TAILIEUCHUNG - Báo cáo khoa học: "Tokenization: Returning to a Long Solved Problem"

We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style, and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy, but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases (. to genreor domain-specific idiosyncrasies). | Tokenization Returning to a Long Solved Problem A Survey Contrastive Experiment Recommendations and Toolkit Rebecca Dridan Stephan Oepen Institutt for Informatikk Universitetet i Oslo rdridan oe @ Abstract We examine some of the frequently disregarded subtleties of tokenization in Penn Treebank style and present a new rule-based preprocessing toolkit that not only reproduces the Treebank tokenization with unmatched accuracy but also maintains exact stand-off pointers to the original text and allows flexible configuration to diverse use cases . to genre-or domain-specific idiosyncrasies . 1 Introduction Motivation The task of tokenization is hardly counted among the grand challenges of NLP and is conventionally interpreted as breaking up natural language text . into distinct meaningful units or tokens Kaplan 2005 . Practically speaking however tokeniza-tion is often combined with other string-level pre-processing for example normalization of punctuation of different conventions for dashes say disambiguation of quotation marks into opening vs. closing quotes or removal of unwanted mark-up where the specifics of such pre-processing depend both on properties of the input text as well as on assumptions made in downstream processing. Applying some string-level normalization prior to the identification of token boundaries can improve or simplify tokenization and a sub-task like the disambiguation of quote marks would in fact be hard to perform after tokenization seeing that it depends on adjacency to whitespace. In the following we thus assume a generalized notion of tokenization comprising all string-level processing up to and including the conversion of a sequence of characters a string to a sequence of token Obviously some of the normalization we include in the to-kenization task in this generalized interpretation could be left to downstream analysis where a tagger or parser for example could be expected to accept non-disambiguated quote marks .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.