TAILIEUCHUNG - Báo cáo khoa học: "Language-independent bilingual terminology extraction from a multilingual parallel corpus"

We present a language-pair independent terminology extraction module that is based on a sub-sentential alignment system that links linguistically motivated phrases in parallel texts. Statistical filters are applied on the bilingual list of candidate terms that is extracted from the alignment output. We compare the performance of both the alignment and terminology extraction module for three different language pairs (French-English, French-Italian and French-Dutch) and highlight languagepair specific problems (. different compounding strategy in French and Dutch). . | Language-independent bilingual terminology extraction from a multilingual parallel corpus Els Lefever1 2 Lieve Macken1 2 and Veronique Hoste1 2 1LT3 School of Translation Studies University College Ghent Groot-Brittannielaan 45 9000 Gent Belgium 2Department of Applied Mathematics and Computer Science Ghent University Krijgslaan281-S9 9000 Gent Belgium Abstract We present a language-pair independent terminology extraction module that is based on a sub-sentential alignment system that links linguistically motivated phrases in parallel texts. Statistical filters are applied on the bilingual list of candidate terms that is extracted from the alignment output. We compare the performance of both the alignment and terminology extraction module for three different language pairs French-English French-Italian and French-Dutch and highlight languagepair specific problems . different compounding strategy in French and Dutch . Comparisons with standard terminology extraction programs show an improvement of up to 20 for bilingual terminology extraction and competitive results 85 to 90 accuracy for monolingual terminology extraction and reveal that the linguistically based alignment module is particularly well suited for the extraction of complex multiword terms. 1 Introduction Automatic Term Recognition ATR systems are usually categorized into two main families. On the one hand the linguistically-based or rule-based approaches use linguistic information such as PoS tags chunk information etc. to filter out stop words and restrict candidate terms to predefined syntactic patterns Ananiadou 1994 Dagan and Church 1994 . On the other hand the statistical corpus-based approaches select n-gram sequences as candidate terms that are filtered by means of @ statistical measures. More recent ATR systems use hybrid approaches that combine both linguistic and statistical information Frantzi and Anani-adou 1999 . Most bilingual terminology

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.