TAILIEUCHUNG - Báo cáo khoa học: "Unsupervised Segmentation of Chinese Text by Use of Branching Entropy"

Figure 1: Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing the uncertainty of successive tokens via the branching entropy (which we mathematically de ne in the next section). Our intention in this paper is above all to study the fundamental and scienti c statistical property underlying language data, so that it can be applied to language engineering. The above assumption (A) dates back to the fundamental work done by Harris (Harris, 1955), where he says that when the number of di erent tokens coming after every pre x of a word marks. | Unsupervised Segmentation of Chinese Text . a . Zh huiJin and Kumiko Tanaka-Ishi Graduate School of Information Science and Technology University of Tokyo Abstract We propose an unsupervised segmentation method based on an assumption about language data that the increasing point of ntropy of success veoha-acters 1 the location of a word boundary A large-scale expern ent was conducted by. using 200 MB o unsegmented training data and 1 MB of test data and precision of 90 vasat ained with reca 1 being around 80 . Moreover we found that the precision was s able at around 90 independently of the learning data size. i Introduct ion . The theme of this paper is the following as-sumpdon . The uncertainty o token coming after a sequence helps determine whether a given position is at a boundary. A . Intuitively as illustrated in FigureM the variety of successive tokens at each character inside a word mono onieallv de teases according to the offset length because th longer he preceding character n-gram the longer the p eceding contex and the more 1 restricts the appearance of possible next tokens Forex-ample it is easier o guess wh h character conies after natura than after na . On the other hand the uncertainty at the po ition of a word border becom s greater and the complexity increases as the position is out of context. With the same example it is difficult to guess which character comes after natural . This suggests that a word border can be detected by focusing on the differentials of the uncertainty of branching. In this paper we report our study on applying this assumption to Chinese word seg- Figure ft Intuitive illustration of a variety of successive tokens and a word boundary mentation by formalizing he uncertainty of su ce sive tokens via the branching ntropy which we mathematically define in the next s ction . Ou Intel ion in this paper is above all to study the fundamental and scientific stat stical property nderly ng language data so that it can be applied to .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.