TAILIEUCHUNG - Báo cáo khoa học: "Joint Hebrew Segmentation and Parsing using a PCFG-LA Lattice Parser"

We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. | Joint Hebrew Segmentation and Parsing using a PCFG-LA Lattice Parser Yoav Goldberg and Michael Elhadad Ben Gurion University of the Negev Department of Computer Science POB 653 Be er Sheva 84105 Israel yoavg elhadad @ Abstract We experiment with extending a lattice parsing methodology for parsing Hebrew Goldberg and Tsarfaty 2008 Golderg et al. 2009 to make use of a stronger syntactic model the PCFG-LA Berkeley Parser. We show that the methodology is very effective using a small training set of about 5500 trees we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80 an error reduction of over 20 over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs. 1 Introduction Most work on parsing assumes that the lexical items in the yield of a parse tree are fully observed and correspond to space delimited tokens perhaps after a deterministic preprocessing step of tokeniza-tion. While this is mostly the case for English the situation is different in languages such as Chinese in which word boundaries are not marked and the Semitic languages of Hebrew and Arabic in which various particles corresponding to function words are agglutinated as affixes to content bearing words sharing the same space-delimited token. For example the Hebrew token bcl1 can be interpreted as the single noun meaning onion or as a sequence of a preposition and a noun b-cl meaning in the shadow . In such languages the sequence of lexical 1 We adopt here the transliteration scheme of Sima an et al. 2001 704 items corresponding to an input string is ambiguous and cannot be determined using a deterministic procedure. In this work we focus on constituency parsing of Modern Hebrew henceforth Hebrew from raw unsegmented text. A common method of approaching the discrepancy between input strings and space delimited tokens is using a .

TỪ KHÓA LIÊN QUAN
TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.