TAILIEUCHUNG - Báo cáo khoa học: "Corpus representativeness for syntactic information acquisition"

The question we have addressed here is to define the size and composition of the corpus we would need in order to get necessary and sufficient information for Machine Learning techniques to induce that type of information. Representativeness of a corpus is a topic largely dealt with, especially in corpus linguistics. One of the standard references is Biber (1993) where the author offers guidelines for corpus design to characterize a language. | Corpus representativeness for syntactic information acquisition Núria BEL IULA Universitat Pompeu Fabra La Rambla 30-32 08002 Barcelona Spain Abstract This paper refers to part of our research in the area of automatic acquisition of computational lexicon information from corpus. The present paper reports the ongoing research on corpus representativeness. For the task of inducing information out of text we wanted to fix a certain degree of confidence on the size and composition of the collection of documents to be observed. The results show that it is possible to work with a relatively small corpus of texts if it is tuned to a particular domain. Even more it seems that a small tuned corpus will be more informative for real parsing than a general corpus. 1 Introduction The coverage of the computational lexicon used in deep Natural Language Processing NLP is crucial for parsing success. But rather frequently the absence of particular entries or the fact that the information encoded for these does not cover very specific syntactic contexts --as those found in technical texts make high informative grammars not suitable for real applications. Moreover this poses a real problem when porting a particular application from domain to domain as the lexicon has to be re-encoded in the light of the new domain. In fact in order to minimize ambiguities and possible over-generation application based lexicons tend to be tuned for every specific domain addressed by a particular application. Tuning of lexicons to different domains is really a delaying factor in the deployment of NLP applications as it raises its costs not only in terms of money but also and crucially in terms of time. A desirable solution would be a plug and play system that given a collection of documents supplied by the customer could induce a tuned lexicon. By tuned we mean full coverage both in terms of 1 entries detecting new items and assigning them a syntactic behavior pattern and 2 syntactic

Việt Nga 48 4 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Corpus representativeness for syntactic information acquisition"

4 40 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 461993 55

Giới thiệu :Lập trình mã nguồn mở

14 23389 68

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11042 534

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10260 453

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9595 106

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8481 1143

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8314 423

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7904 2240

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 6902 258

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6389 1547

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

Đánh giá hao mòn và độ tin cậy của chi tiết và kết cấu trên đầu máy diezel part 3

12 330 0 05-06-2024

Management and Services Part 1

10 176 0 05-06-2024

Posted prices versus bargaining in markets_7

23 170 0 05-06-2024

Lãi suất cơ bản, công cụ quan trọng của chính sách tiền tệ

5 126 0 05-06-2024

MẪU GIẤY PHÉP VẬN TẢI LOẠI C

2 126 0 05-06-2024

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 116 0 05-06-2024

Báo cáo nghiên cứu nông nghiệp " Biofertiliser inoculant technology for the growth of rice in Vietnam: Developing technical infrastructure for quality assurance and village production for farmers "

12 104 0 05-06-2024

Điều bạn cần làm để giữ chặt tình yêu

5 120 1 05-06-2024

The Constituents of Medicinal Plants

185 116 0 05-06-2024

ĐỀ THI THỬ ĐH NĂM 2011 MÔN VẬT LÍ _ ĐỀ SỐ 101

7 105 0 05-06-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 7904 2240

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 6389 1547

Ebook Chào con ba mẹ đã sẵn sàng

112 3922 1284

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 5530 1152

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8481 1143

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3587 662

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3791 570

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11042 534

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4232 527

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4242 483