Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-theart by more than 5%. . | Profiting from Mark-Up Hyper-Text Annotations for Guided Parsing Valentin I. Spitkovsky Computer Science Department Stanford University and Google Inc. valentin@google.com Daniel Jurafsky Departments of Linguistics and Computer Science Stanford University jurafsky@stanford.edu Hiyan Alshawi Google Inc. hiyan@google.com Abstract We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags anchors bold italics and underlines we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning s Dependency Model with Valence DMV on this data set parsing accuracy on Section 23 all sentences of the Wall Street Journal corpus jumps to 50.4 beating previous state-of-the-art by more than 5 . Web-scale experiments show that the DMV perhaps because it is unlexicalized does not benefit from orders of magnitude more annotated but noisier data. Our model trained on a single blog generalizes to 53.3 accuracy out-of-domain against the Brown corpus nearly 10 higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP. 1 Introduction Unsupervised learning of hierarchical syntactic structure from free-form natural language text is a hard problem whose eventual solution promises to benefit applications ranging from question answering to speech recognition and machine translation. A restricted version of this problem that targets dependencies and assumes partial annotation sentence boundaries and part-of-speech POS tagging has received much attention. Klein and Manning 2004 were the first to beat a simple parsing heuristic the right-branching baseline today s state-of-the-art systems Headden et al. 2009 Cohen