TAILIEUCHUNG - Báo cáo khoa học: "Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model "

We present a novel OCR error correction method for languages without word delimiters that have a large character set, such as Japanese and Chinese. It consists of a statistical OCR model, an approximate word matching method using character shape similarity, and a word segmentation algorithm using a statistical language model. By using a statistical OCR model and character shape similarity, the proposed error corrector outperforms the previously published method. When the baseline character recognition accuracy is 90%, it achieves character recognition accuracy. . | Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model Masaaki NAGATA NTT Information and Communication Systems Laboratories 1-1 Hikari-no-oka Yokosuka-Shi Kanagawa 239-0847 Japan nagata@ Abstract We present a novel OCR error correction method for languages without word delimiters that have a large character set such as Japanese and Chinese. It consists of a statistical OCR model an approximate word matching method using character shape similarity and a word segmentation algorithm using a statistical language model. By using a statistical OCR model and character shape similarity the proposed error corrector outperforms the previously published method. When the baseline character recognition accuracy is 90 it achieves character recognition accuracy. 1 Introduction As our society is becoming more computerized people are getting enthusiastic about entering everything into computers. So the need for OCR in areas such as office automation and information retrieval is becoming larger contrary to our expectation. In Japanese although the accuracy of printed character OCR is about 98 sources such as old books poor quality photocopies and faxes are still difficult to process and cause many errors. The accuracy of handwritten OCR is still about 90 Hildebrandt and Liu 1993 and it worsens dramatically when the input quality is poor. If NLP techniques could be used to boost the accuracy of handwriting and poor quality documents we could enjoy a very large market for OCR related applications. OCR error correction can be thought of a spelling correction problem. Although spelling correction has been studied for several decades Kukich 1992 the traditional techniques are implicitly based on English and cannot be used for Asian languages such as Japanese and Chinese. The traditional strategy for English spelling correction is called isolated word error correction Word boundaries are placed by white spaces. If the .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.