TAILIEUCHUNG - Báo cáo khoa học: "Learning-Based Named Entity Recognition for Morphologically-Rich, Resource-Scarce Languages"

Named entity recognition for morphologically rich, case-insensitive languages, including the majority of semitic languages, Iranian languages, and Indian languages, is inherently more difficult than its English counterpart. Worse still, progress on machine learning approaches to named entity recognition for many of these languages is currently hampered by the scarcity of annotated data and the lack of an accurate part-of-speech tagger. | Learning-Based Named Entity Recognition for Morphologically-Rich Resource-Scarce Languages Kazi Saidul Hasan and Md. Altaf ur Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson TX 75083-0688 saidul altaf vince @ Abstract Named entity recognition for morphologically rich case-insensitive languages including the majority of semitic languages Iranian languages and Indian languages is inherently more difficult than its English counterpart. Worse still progress on machine learning approaches to named entity recognition for many of these languages is currently hampered by the scarcity of annotated data and the lack of an accurate part-of-speech tagger. While it is possible to rely on manually-constructed gazetteers to combat data scarcity this gazetteer-centric approach has the potential weakness of creating irreproducible results since these name lists are not publicly available in general. Motivated in part by this concern we present a learning-based named entity recognizer that does not rely on manually-constructed gazetteers using Bengali as our representative resource-scarce morphologically-rich language. Our recognizer achieves a relative improvement of in F-measure over a baseline recognizer. Improvements arise from 1 using induced affixes 2 extracting information from online lexical databases and 3 jointly modeling part-of-speech tagging and named entity recognition. 1 Introduction While research in natural language processing has gained a lot of momentum in the past several decades much of this research effort has been focusing on only a handful of politically-important languages such as English Chinese and Arabic. On the other hand being the fifth most spoken lan-guage1 with more than 200 million native speakers residing mostly in Bangladesh and the Indian state of West Bengal Bengali has far less electronic resources than the aforementioned languages. In fact a major obstacle to the .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.