Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper describes two methods for detecting word segments and their morphological information in a Japanese spontaneous speech corpus, and describes how to tag a large spontaneous speech corpus accurately by using the two methods. The first method is used to detect any type of word segments. The second method is used when there are several definitions for word segments and their POS categories, and when one type of word segments includes another type of word segments. | Morphological Analysis of a Large Spontaneous Speech Corpus in Japanese Kiyotaka I chimoto Chikashi Nobata1 Atsushi Yamada1 Satoshi Sekine1 Hitoshi Isahara1 Communications Research Laboratory 3-5 Hikari-dai Seika-cho Soraku-gun Kyoto 619-0289 Japan uchimoto nova ark isahara @crl.go.jp New York University 715 Broadway 7th floor New York NY 10003 USA sekine@cs.nyu.edu Abstract This paper describes two methods for detecting word segments and their morphological information in a Japanese spontaneous speech corpus and describes how to tag a large spontaneous speech corpus accurately by using the two methods. The first method is used to detect any type of word segments. The second method is used when there are several definitions for word segments and their POS categories and when one type of word segments includes another type of word segments. In this paper we show that by using semiautomatic analysis we achieve a precision of better than 99 for detecting and tagging short words and 97 for long words the two types of words that comprise the corpus. We also show that better accuracy is achieved by using both methods than by using only the first. 1 Introduction The Spontaneous Speech Corpus and Processing Technology project is sponsoring the construction of a large spontaneous Japanese speech corpus Corpus of Spontaneous Japanese CSJ Maekawa et al. 2000 . The CSJ is a collection of monologues and dialogues the majority being monologues such as academic presentations and simulated public speeches. Simulated public speeches are short speeches presented specifically for the corpus by paid non-professional speakers. The CSJ in cludes transcriptions of the speeches as well as audio recordings of them. One of the goals of the project is to detect two types of word segments and corresponding morphological information in the transcriptions. The two types of word segments were defined by the members of The National Institute for Japanese Language and are called short word and .