TAILIEUCHUNG - Báo cáo khoa học: "Minimized models and grammar-informed initialization for supertagging with highly ambiguous lexicons"

Department of Linguistics The University of Texas at Austin Austin, Texas 78712 jbaldrid@ or trigram Hidden Markov Model (HMM). Ravi and Knight (2009) achieved the best results thus far ( word token accuracy) via a Minimum Description Length approach using an integer program (IP) that finds a minimal bigram grammar that obeys the tag dictionary constraints and covers the observed data. | Minimized models and grammar-informed initialization for supertagging with highly ambiguous lexicons Sujith Ravi1 Jason Baldridge2 Kevin Knight1 University of Southern California Information Sciences Institute Marina del Rey California 90292 sravi knight @ Abstract We combine two complementary ideas for learning supertaggers from highly ambiguous lexicons grammar-informed tag transitions and models minimized via integer programming. Each strategy on its own greatly improves performance over basic expectation-maximization training with a bitag Hidden Markov Model which we show on the CCGbank and CCG-TUT corpora. The strategies provide further error reductions when combined. We describe a new two-stage integer programming strategy that efficiently deals with the high degree of ambiguity on these datasets while obtaining the full effect of model minimization. 1 Introduction Creating accurate part-of-speech POS taggers using a tag dictionary and unlabeled data is an interesting task with practical applications. It has been explored at length in the literature since Merialdo 1994 though the task setting as usually defined in such experiments is somewhat artificial since the tag dictionaries are derived from tagged corpora. Nonetheless the methods proposed apply to realistic scenarios in which one has an electronic part-of-speech tag dictionary or a hand-crafted grammar with limited coverage. Most work has focused on POS-tagging for English using the Penn Treebank Marcus et al. 1993 such as Banko and Moore 2004 Goldwater and Griffiths 2007 Toutanova and Johnson 2008 Goldberg et al. 2008 Ravi and Knight 2009 . This generally involves working with the standard set of 45 POS-tags employed in the Penn Treebank. The most ambiguous word has 7 different POS tags associated with it. Most methods have employed some variant of Expectation Maximization EM to learn parameters for a bigram 2Department of Linguistics The University of Texas at Austin Austin Texas 78712 .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.