TAILIEUCHUNG - Báo cáo khoa học: "One Tokenization per Source"

We report in this paper the observation of one tokenization per source. That is, the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice, and is argued to be with far-reaching implications in natural language processing. | One Tokenization per Source Jin GUO Kent Ridge Digital Labs 21 Heng Mui Keng Terrace Singapore 119613 Abstract We report in this paper the observation of one tokenization per source. That is the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice and is argued to be with far-reaching implications in natural language processing. 1 Introduction This paper sets to establish the hypothesis of one tokenization per source. That is if an ambiguous fragment appears two or more times in different sentences from the same source it is extremely likely that they will all share the same tokenization. Sentence tokenization is the task of mapping sentences from character strings into streams of tokens. This is a long-standing problem in Chinese Language Processing since in Chinese there is an apparent lack of such explicit word delimiters as white-spaces in English. And researchers have gradually been turning to model the task as a general lexicalization or bracketing problem in Computational Linguistics with the hope that the research might also benefit the study of similar problems in multiple languages. For instance in Machine Translation it is widely agreed that many multiple-word expressions such as idioms compounds and some collocations while not explicitly delimited in sentences are ideally to be treated as single lexicalized units. The primary obstacle in sentence tokenization is in the existence of uncertainties both in the notion of words tokens and in the recognition of words tokens in context. The same fragment in different contexts would have to be tokenized differently. For instance the character string todayissunday would normally be tokenized as today is Sunday but can also reasonably be today is sun day . In terms of possibility it has been argued that no lexically possible tokenization can not .

TÀI LIỆU MỚI ĐĂNG
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.