Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper examines the use of an unsupervised statistical model for determining the attachment of ambiguous coordinate phrases (CP) of the form n l p n2 cc n3. The model presented here is based on JAR98], an unsupervised model for determining prepositional phrase attachment. After training on unannotated 1988 Wall Street Journal text, the model performs at 72% accuracy on a development set from sections 14 through 19 of the WSJ TreeBank [MSM93]. 1 Introduction [AR98] models. | An Unsupervised Model for Statistically Determining Coordinate Phrase Attachment Miriam Goldberg Central High School Dept of Computer and Information Science 200 South 33rd Street Philadelphia PA 19104-6389 University of Pennsylvania miriamgOunagi.cis.upenn.edu Abstract This paper examines the use of an unsupervised statistical model for determining the attachment of ambiguous coordinate phrases CP of the form nl p n2 cc n3. The model presented here is based on AR98 an unsupervised model for determining prepositional phrase attachment. After training on unannotated 1988 Wall Street Journal text the model performs at 72 accuracy on a development set from sections 14 through 19 of the WSJ TreeBank MSM93 1 Introduction The coordinate phrase CP is a source of structural ambiguity in natural language. For example take the phrase box of chocolates and roses Roses attaches either high to box or low to chocolates . In this case attachment is high yielding H-attach box of chocolates and roses Consider then the phrase salad of lettuce and tomatoes Lettuce attaches low to tomatoes giving L-attach salad of lettuce and tomatoes Previous work has used corpus-based approaches to solve the similar problem of prepositional phrase attachment. These have included backed-off CB 95 maximum entropy RRR94 rule-based HR94 and unsupervised AR98 models. In addition to these a corpusbased model for PP-attachment SN97 has been reported that uses information from a semantic dictionary. Sparse data can be a major concern in corpusbased disambiguation. Supervised models are limited by the amount of annotated data available for training. Such a model is useful only for languages in which annotated corpora are available. Because an unsupervised model does not rely on such corpora it may be modified for use in multiple languages as in AR98 . The unsupervised model presented here trains from an unannotated version of the 1988 Wall Street Journal. After tagging and chunking the text a rough heuristic