Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. | Effective Measures of Domain Similarity for Parsing Barbara Plank University of Groningen The Netherlands Gertjan van Noord University of Groningen The Netherlands b.plank@rug.nl G.J.M.van.Noord@rug.nl Abstract It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain Sekine 1997 Gildea 2001 . Hence an important task is to select appropriate domains. However most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available automatic ways to select data that is beneficial for a new unknown target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective - it outperforms random data selection on both languages examined English and Dutch. Moreover the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain say newspaper text to a particular new domain say biomedical data. Usually some amount of labeled or unlabeled data from the new domain was given - which has been determined by a human. However with the growth of the web more and more data is becoming available where each document is potentially its own domain McClosky et al. 2010 . It is not straightforward to determine 1566 which data or model in case we have several source domain models will perform best on a new unknown target domain. Therefore an important issue that arises is how to measure domain similarity i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. .