TAILIEUCHUNG - Báo cáo khoa học: "Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision"

Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition, the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper, we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework, where each letter in the dictionary word can be retained, removed, or substituted by other letters/digits. . | Insertion Deletion or Substitution Normalizing Text Messages without Pre-categorization nor Supervision Fei Liu1 Fuliang Weng2 Bingqing Wang3 Yang Liu1 1 Computer Science Department The University of Texas at Dallas 2Research and Technology Center Robert Bosch LLC 3School of Computer Science Fudan University feiliu yangll@ wbq@ Abstract Most text message normalization approaches are based on supervised learning and rely on human labeled training data. In addition the nonstandard words are often categorized into different types and specific models are designed to tackle each type. In this paper we propose a unified letter transformation approach that requires neither pre-categorization nor human supervision. Our approach models the generation process from the dictionary words to nonstandard tokens under a sequence labeling framework where each letter in the dictionary word can be retained removed or substituted by other letters digits. To avoid the expensive and time consuming hand labeling process we automatically collected a large set of noisy training pairs using a novel webbased approach and performed character-level alignment for model training. Experiments on both Twitter and SMS messages show that our system significantly outperformed the state-of-the-art deletion-based abbreviation system and the jazzy spell checker absolute accuracy gain of and over jazzy spell checker on the two test sets respectively . 1 Introduction Recent years have witnessed the explosive growth of text message usage including the mobile phone text messages SMS chat logs emails and status updates from the social network websites such as Twitter and Facebook. These text message collections serve as valuable information sources yet the nonstandard contents within them often degrade 71 2gether 6326 2getha 1266 2gthr 178 2qetha 46 togetha 919 togather 207 togehter 94 togethor 29 tgthr 250 t0gether 57 togeter 49 tagether .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.