Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. | Text Segmentation by Language Using Minimum Description Length Hiroshi Yamaguchi Kumiko Tanaka-Ishii Graduate School of Faculty and Graduate School of Information Information Science and Technology Science and Electrical Engineering University of Tokyo Kyushu University yamaguchi.hiroshi@ci.i.u-tokyo.ac.jp kumiko@ait.kyushu-u.ac.jp Abstract The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia covering more than 200 languages. 1 Introduction For the purposes of this paper a multilingual text means one containing text segments limited to those longer than a clause written in different languages. We can often find such texts in linguistic resources collected from the World Wide Web for many nonmajor languages which tend to also contain portions of text in a major language. In automatic processing of such multilingual texts they must first be segmented by language and the language of each segment must be identified since many state-of-the-art NLP applications are built by learning a gold standard for one specific language. Moreover segmentation is useful for other objectives such as collecting linguistic resources for non-major languages and automatically removing portions written in major languages as noted above. The study reported here was motivated by this objective. The problem addressed in this article is thus to segment a multilingual text by language and identify the language of each .