Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The Manually Annotated Sub-Corpus (MASC) project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or ported to any of a variety of other formats. | The Manually Annotated Sub-Corpus A Community Resource For and By the People Nancy Ide Department of Computer Science Vassar College Poughkeepsie NY USA ide@cs.vassar.edu Christiane Fellbaum Princeton University Princeton New Jersey USA fellbaum@princeton.edu Abstract The Manually Annotated Sub-Corpus MASC project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single usable format that can then be analyzed as it is or ported to any of a variety of other formats. MASC includes data from a much wider variety of genres than existing multiply-annotated corpora of English and the project is committed to a fully open model of distribution without restriction for all data and annotations produced or contributed. As such MASC is the first large-scale open communitybased effort to create much needed language resources for NLP. This paper describes the MASC project its corpus and annotations and serves as a call for contributions of data and annotations from the language processing community. 1 Introduction The need for corpora annotated for multiple phenomena across a variety of linguistic layers is keenly recognized in the computational linguistics community. Several multiply-annotated corpora exist especially for Western European languages and for spoken data but interestingly broadbased English language corpora with robust annotation for diverse linguistic phenomena are relatively rare. The most widely-used corpus of English the British National Corpus contains only part-of-speech annotation and although it contains a wider range of annotation types the fif- Collin Baker International Computer Science Institute Berkeley California USA collinb@icsi.berkeley.edu Rebecca Passonneau Columbia University New York New York UsA becky@cs.columbia.edu teen million word Open American National Corpus annotations .