TAILIEUCHUNG - Báo cáo khoa học: "Combining Trigram and Winnow in Thai OCR Error Correction"

For languages that have no explicit word boundary such as Thai, Chinese and Japanese, correcting words in text is harder than in English because of additional ambiguities in locating error words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper, we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. | Combining Trigram and Winnow in Thai OCR Error Correction Surapant Meknavin National Electronics and Computer Technology Center 73 1 Rama VI Road Rajthevi Bangkok Thailand surapan@ Boonserm Kijsirikul Ananlada Chotimongkol and Cholwich Nuttee Department of Computer Engineering Chulalongkorn University Thailand fengbks@chulkn. chula. ac. th Abstract For languages that have no explicit word boundary such as Thai Chinese and Japanese correcting words in text is harder than in English because of additional ambiguities in locating error words. The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them. In this paper we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence. Boundaries of these dubious areas could be obtained approximately by applying word segmentation algorithm and finding word sequences with low probability. To generate the candidate correction words we used a modified edit distance which reflects the characteristic of Thai OCR errors. Finally a part-of-speech trigram model and Winnow algorithm are combined to determine the most probable correction. 1 Introduction Optical character recognition OCR is useful in a wide range of applications such as office automation and information retrieval system. However OCR in Thailand is still not widely used partly because existing Thai OCRs are not quite satisfactory in terms of accuracy. Recently several research projects have focused on spelling correction for many types of errors including those from OCR Kukich 1992 . Nevertheless the strategy is slightly different from language to language since the characteristic of each language is different. Two characteristics of Thai which make the task of error correction different from those of other languages are 1 there is no explicit word boundary and 2 characters are written in three levels . the .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.