Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We investigate different feature sets for performing automatic sentence-level discourse segmentation within a general machine learning approach, including features derived from either finite-state or contextfree annotations. We achieve the best reported performance on this task, and demonstrate that our SPADE-inspired context-free features are critical to achieving this level of accuracy. This counters recent results suggesting that purely finite-state approaches can perform competitively. Nucleus | The utility of parse-derived features for automatic discourse segmentation Seeger Fisher and Brian Roark Center for Spoken Language Understanding OGI School of Science Engineering Oregon Health Science University Beaverton Oregon 97006 USA fishers roark @cslu.ogi.edu Abstract We investigate different feature sets for performing automatic sentence-level discourse segmentation within a general machine learning approach including features derived from either finite-state or context-free annotations. We achieve the best reported performance on this task and demonstrate that our SPADE-inspired context-free features are critical to achieving this level of accuracy. This counters recent results suggesting that purely finite-state approaches can perform competitively. 1 Introduction Discourse structure annotations have been demonstrated to be of high utility for a number of NLP applications including automatic text summarization Marcu 1998 Marcu 1999 Cristea et al. 2005 sentence compression Sporleder and Lap-ata 2005 natural language generation Prasad et al. 2005 and question answering Verberne et al. 2006 . These annotations include sentence segmentation into discourse units along with the linking of discourse units both within and across sentence boundaries into a labeled hierarchical structure. For example the tree in Figure 1 shows a sentence-level discourse tree for the string Prices have dropped but remain quite high according to CEO Smith which has three discourse segments each labeled with either Nucleus or Satellite depending on how central the segment is to the coherence of the text. There are a number of corpora annotated with discourse structure including the well-known RST Treebank Carlson et al. 2002 the Discourse GraphBank Wolf and Gibson 2005 and the Penn Discourse Treebank Miltsakaki et al. 2004 . While the annotation approaches differ across these corpora the requirement of sentence segmentation into 488 Root Figure 1 Example Nucleus Satellite labeled .