TAILIEUCHUNG - Báo cáo khoa học: "Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model "

We present a novel OCR error correction method for languages without word delimiters that have a large character set, such as Japanese and Chinese. It consists of a statistical OCR model, an approximate word matching method using character shape similarity, and a word segmentation algorithm using a statistical language model. By using a statistical OCR model and character shape similarity, the proposed error corrector outperforms the previously published method. When the baseline character recognition accuracy is 90%, it achieves character recognition accuracy. . | Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model Masaaki NAGATA NTT Information and Communication Systems Laboratories 1-1 Hikari-no-oka Yokosuka-Shi Kanagawa 239-0847 Japan nagata@ Abstract We present a novel OCR error correction method for languages without word delimiters that have a large character set such as Japanese and Chinese. It consists of a statistical OCR model an approximate word matching method using character shape similarity and a word segmentation algorithm using a statistical language model. By using a statistical OCR model and character shape similarity the proposed error corrector outperforms the previously published method. When the baseline character recognition accuracy is 90 it achieves character recognition accuracy. 1 Introduction As our society is becoming more computerized people are getting enthusiastic about entering everything into computers. So the need for OCR in areas such as office automation and information retrieval is becoming larger contrary to our expectation. In Japanese although the accuracy of printed character OCR is about 98 sources such as old books poor quality photocopies and faxes are still difficult to process and cause many errors. The accuracy of handwritten OCR is still about 90 Hildebrandt and Liu 1993 and it worsens dramatically when the input quality is poor. If NLP techniques could be used to boost the accuracy of handwriting and poor quality documents we could enjoy a very large market for OCR related applications. OCR error correction can be thought of a spelling correction problem. Although spelling correction has been studied for several decades Kukich 1992 the traditional techniques are implicitly based on English and cannot be used for Asian languages such as Japanese and Chinese. The traditional strategy for English spelling correction is called isolated word error correction Word boundaries are placed by white spaces. If the .

Kiều Dung 60 7 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Japanese OCR Error Correction using Character Shape Similarity and Statistical Language Model "

7 43 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462370 61

Giới thiệu :Lập trình mã nguồn mở

14 26979 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11382 543

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10580 468

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9861 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8911 1161

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8531 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8111 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 8028 1833

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7306 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Báo cáo nghiên cứu khoa học " KẾT QUẢ NGHIÊN CỨU BƯỚC ĐẦU VỀ THIÊN ĐỊCH CHÂN KHỚP TRÊN CÂY THANH TRÀ Ở THỪA THIÊN HUẾ "

7 289 4 15-01-2025

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 164 3 15-01-2025

Giáo trình phân tích phương trình vi phân viết dưới dạng thuật toán đặc tính của hệ thống p1

5 171 1 15-01-2025

Quy Trình Canh Tác Cây Bông Vải

8 171 3 15-01-2025

Bảng màu theo chữ cái – V

11 177 2 15-01-2025

Hướng dẫn chế độ dinh dưỡng cho người bệnh viêm khớp

5 177 2 15-01-2025

Đề tài " Dự báo về tác động của Tổ chức Thương mại Thế giới WTO đối với các doanh nghiệp xuất khẩu vừa và nhỏ Việt Nam – Những giải pháp đề xuất "

72 194 2 15-01-2025

Báo cáo y học: "The Factors Influencing Depression Endpoints Research (FINDER) study: final results of Italian patients with depressio"

9 157 1 15-01-2025

Bệnh sán lá gan trên gia súc và cách phòng trị

3 170 1 15-01-2025

Báo cáo nghiên cứu khoa học " Vai trò chính quyền địa phương trong phát triển kinh tế : khu chuyên doanh gốm sứ ( Trung Quốc ) và Bát Tràng ( Việt Nam )("

11 218 1 15-01-2025

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8111 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 8028 1833

Ebook Chào con ba mẹ đã sẵn sàng

112 4457 1378

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6411 1280

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8911 1161

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3867 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3932 610

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4814 568

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11382 543

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4544 490