Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Disfluent speech adds to the difficulty of processing spoken language utterances. In this paper we concentrate on identifying one disfluency phenomenon: fragmented words. Our data, from the Spoken Dutch Corpus, samples nearly 45,000 sentences of human discourse, ranging from spontaneous chat to media broadcasts. We classify each lexical item in a sentence either as a completely or an incompletely uttered, i.e. fragmented, word. | Learning to Identify Fragmented Words in Spoken Discourse Piroska Lendvai ILK Research Group Tilburg University The Netherlands p.lendvai@uvt.nl Abstract Disfluent speech adds to the difficulty of processing spoken language utterances. In this paper we concentrate on identifying one disfluency phenomenon fragmented words. Our data from the Spoken Dutch Corpus samples nearly 45 000 sentences of human discourse ranging from spontaneous chat to media broadcasts. We classify each lexical item in a sentence either as a completely or an incompletely uttered i.e. fragmented word. The task is carried out both by the IB1 and RIPPER machine learning algorithms trained on a variety of features with an extensive optimization strategy. Our best classifier has a 74.9 F-score which is a significant improvement over the baseline. We discuss why memory-based learning has more success than rule induction in correctly classifying fragmented words. 1 Introduction Although human listeners are good at handling disfluent items self-corrections repetitions hesitations incompletely uttered words and the like cf. Shriberg 1994 in spoken language utterances these are likely to cause confusion when used as input to automatic natural language processing NLP systems resulting in poor humancomputer interaction Nakatani and Hirschberg 1994 Eklund and Shriberg 1998 . Detecting dis-fluent passages can help clean the spoken input and improve further processing such as parsing. By treating fragments we cover a considerable portion of the occurring disfluencies as incompletely uttered words often occur as part of a speaker s self-repair Bear et al. 1992 Nakatani and Hirschberg 1994 . Moreover if an incompletely pronounced item is identified we thereby determine the interruption point a central phenomenon in disfluencies Bear et al. 1992 Hee-man 1999 Shriberg et al. 2001 . The surroundings of this disfluency element are to be treated with greater care as before an interruption point there might be word