TAILIEUCHUNG - Báo cáo khoa học: "Bilingual Terminology Mining – Using Brain, not brawn comparable corpora"

Current research in text mining favours the quantity of texts over their quality. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the quality rather than the quantity of the corpus matters more in terminology mining. Our hypothesis, therefore, is that the quality of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. . | Bilingual Terminology Mining - Using Brain not brawn comparable corpora E. Morin B. Daille Université de Nantes LINA FRE CNRS 2729 2 rue de la Houssinière BP 92208 F-44322 Nantes Cedex 03 morin-e daille-b @ K. Takeuchi Okayama University 3-1-1 Tsushimanaka Okayama-shi Okayama 700-8530 Japan koichi@ K. Kageura Graduate School of Education The University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 Japan kyo@ Abstract Current research in text mining favours the quantity of texts over their quality. But for bilingual terminology mining and for many language pairs large comparable corpora are not available. More importantly as terms are defined vis-à-vis a specific domain with a restricted register it is expected that the quality rather than the quantity of the corpus matters more in terminology mining. Our hypothesis therefore is that the quality of the corpus is more important than the quantity and ensures the quality of the acquired terminological resources. We show how important the type of discourse is as a characteristic of the comparable corpus. 1 Introduction Two main approaches exist for compiling corpora Big is beautiful or Insecurity in large collections . Text mining research commonly adopts the first approach and favors data quantity over quality. This is normally justified on the one hand by the need for large amounts of data in order to make use of statistic or stochastic methods Manning and Schutze 1999 and on the other by the lack of operational methods to automatize the building of a corpus answering to selected criteria such as domain register media style or discourse. 664 For lexical alignment from comparable corpora good results on single words can be obtained from large corpora several millions words the accuracy of proposed translation is about 80 for the top 10-20 candidates Fung 1998 Rapp 1999 Chiao and Zweigenbaum 2002 . Cao and Li 2002 have achieved 91 accuracy for the top three candidates

Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.