TAILIEUCHUNG - Báo cáo khoa học: "An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation"

In this paper, we describe a method for structural noun phrase disambiguation which mainly relies on the examination of the text corpus under analysis and doesn't need to integrate any domain-dependent lexico- or syntactico-semantic information. This method is implemented in the Terminology Extraction Sotware LEXTER. We first explain why the integration of LEXTER in the LEXTER-K project, which aims at building a tool for knowledge extraction from large technical text corpora, requires improving the quality of the terminolgy extracted by LEXTER. Then we briefly describe the way LEXTER works and show what kind of disambiguation it has to perform. | An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation Didier Bourigault Centre d Analyse et de Mathematiques Sociales EHESS - Paris Sorbonne - CNRS and Electricité de France - Dừectìon des Etudes et Recherches Service Informatique et Mathématiques Appliquées 1 avenue du Général de Gaulle 92141 Clamart Cedex FRANCE Abstract In this paper we describe a method for structural noun phrase disambiguation which mainly relies on the examination of the text corpus under analysis and doesn t need to integrate any domain-dependent lexico- or syntactico-semantic information. This method is implemented in the Terminology Extraction Sotware LEXTER. We first explain why the integration of LEXTER in the LEXTER-K project which aims at building a tool for knowledge extraction from large technical text corpora requires improving the quality of the terminolgy extracted by LEXTER. Then we briefly describe the way LEXTER . works and show what kind of disambiguation it has to perform when parsing maximal-length noun phrases. We introduce a method of disambiguation which relies on a very simple idea whenever LEXTER has to choose among several competing noun sub-groups in order to disambiguate a maximal-length noun phrase it checks each of these sub-groups if it occurs anywhere else in the corpus in a non-ambiguous situation and then it makes a choice. The half-a-million words corpus analysis resulted in an efficient strategy of disambiguation. The average rates are 27 no disambiguation 70 correct disambiguation 3 wrong disambiguation 1 The LEXTER-K project knowledge extraction from large technical text corpora LEXTER is a Terminology Extraction Software Bourigault 1992a 1992b . A corpus of French-language texts on any technical subject is fed in. LEXTER performs a grammatical analysis of this corpus and yields a list of noun phrases which are likely to be terminological units representing the concepts of the subject field. This list together with the corpus it has .

TỪ KHÓA LIÊN QUAN
TÀI LIỆU HOT