Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We present a term recognition approach to extract acronyms and their definitions from a large text collection. Parenthetical expressions appearing in a text collection are identified as potential acronyms. Assuming terms appearing frequently in the proximity of an acronym to be the expanded forms (definitions) of the acronyms, we apply a term recognition method to enumerate such candidates and to measure the likelihood scores of the expanded forms. Based on the list of the expanded forms and their likelihood scores, the proposed algorithm determines the final acronym-definition pairs. . | A Term Recognition Approach to Acronym Recognition Naoaki Okazaki Graduate School of Information Science and Technology The University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-8656 Japan okazaki@mi.ci.i.u-tokyo.ac.jp Sophia Ananiadou National Centre for Text Mining School of Informatics Manchester University PO Box 88 Sackville Street Manchester M60 1QD United Kingdom Sophia.Ananiadou@manchester.ac.uk Abstract We present a term recognition approach to extract acronyms and their definitions from a large text collection. Parenthetical expressions appearing in a text collection are identified as potential acronyms. Assuming terms appearing frequently in the proximity of an acronym to be the expanded forms definitions of the acronyms we apply a term recognition method to enumerate such candidates and to measure the likelihood scores of the expanded forms. Based on the list of the expanded forms and their likelihood scores the proposed algorithm determines the final acronym-definition pairs. The proposed method combined with a letter matching algorithm achieved 78 precision and 85 recall on an evaluation corpus with 4 212 acronym-definition pairs. 1 Introduction In the biomedical literature the amount of terms names of genes proteins chemical compounds drugs organisms etc is increasing at an astounding rate. Existing terminological resources and scientific databases such as Swiss-Prot 1 SGD2 FlyBase3 and UniProt4 cannot keep up-to-date with the growth of neologisms Pustejovsky et al. 2001 . Although curation teams maintain terminological resources integrating neologisms is very difficult if not based on systematic extraction and Research Fellow of the Japan Society for the Promotion of Science JSPS 1 http www.ebi.ac.uk swissprot 2http www.yeastgenome.org 3http www.flybase.org 4http www.ebi.ac.uk GOA collection of terminology from literature. Term identification in literature is one of the major bottlenecks in processing information in biology as it faces many challenges