TAILIEUCHUNG - Báo cáo khoa học: "Generating Usable Formats for Metadata and Annotations in a Large Meeting Corpus"

The AMI Meeting Corpus is now publicly available, including manual annotation files generated in the NXT XML format, but lacking explicit metadata for the 171 meetings of the corpus. To increase the usability of this important resource, a representation format based on relational databases is proposed, which maximizes informativeness, simplicity and reusability of the metadata and annotations. | Generating Usable Formats for Metadata and Annotations in a Large Meeting Corpus Andrei Popescu-Belis and Paula Estrella ISSCO TIM ETI University of Geneva 40 bd. du Pont-d Arve 1211 Geneva 4 - Switzerland @ Abstract The AMI Meeting Corpus is now publicly available including manual annotation files generated in the NXT XML format but lacking explicit metadata for the 171 meetings of the corpus. To increase the usability of this important resource a representation format based on relational databases is proposed which maximizes informativeness simplicity and reusability of the metadata and annotations. The annotation files are converted to a tabular format using an easily adaptable XSLT-based mechanism and their consistency is verified in the process. Metadata files are generated directly in the IMDI XML format from implicit information and converted to tabular format using a similar procedure. The results and tools will be freely available with the AMI Corpus. Sharing the metadata using the Open Archives network will contribute to increase the visibility of the AMI Corpus. 1 Introduction The AMI Meeting Corpus Carletta and al. 2006 is one of the largest and most extensively annotated data sets of multimodal recordings of human interaction. The corpus contains 171 meetings in English for a total duration of ca. 100 hours. The meetings either follow the remote control design scenario or are naturally occurring meetings. In both cases they have between 3 and 5 participants. Perhaps the most valuable resources in this corpus are the high quality annotations which can be 93 used to train and test NLP tools. The existing annotation dimensions include beside transcripts forced temporal alignment named entities topic segmentation dialogue acts abstractive and extractive summaries as well as hand and head movement and posture. However these dimensions as well as the implicit metadata for the corpus are difficult to exploit .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.