TAILIEUCHUNG - Báo cáo khoa học: "Grounded Language Modeling for Automatic Speech Recognition of Sports Video"

Grounded language models represent the relationship between words and the non-linguistic context in which they are said. This paper describes how they are learned from large corpora of unlabeled video, and are applied to the task of automatic speech recognition of sports video. Results show that grounded language models improve perplexity and word error rate over text based language models, and further, support video information retrieval better than human generated speech transcriptions. | Grounded Language Modeling for Automatic Speech Recognition of Sports Video Michael Fleischman Massachusetts Institute of Technology Media Laboratory mbf@ Deb Roy Massachusetts Institute of Technology Media Laboratory dkroy@ Abstract Grounded language models represent the relationship between words and the non-linguistic context in which they are said. This paper describes how they are learned from large corpora of unlabeled video and are applied to the task of automatic speech recognition of sports video. Results show that grounded language models improve perplexity and word error rate over text based language models and further support video information retrieval better than human generated speech transcriptions. 1 Introduction Recognizing speech in broadcast video is a necessary precursor to many multimodal applications such as video search and summarization Snoek and Worring 2005 . Although performance is often reasonable in controlled environments such as studio news rooms automatic speech recognition ASR systems have significant difficulty in noisier settings such as those found in live sports broadcasts Wactlar et al. 1996 . While many researches have examined how to compensate for such noise using acoustic techniques few have attempted to leverage information in the visual stream to improve speech recognition performance for an exception see Murkherjee and Roy 2003 . In many types of video however visual context can provide valuable clues as to what has been said. For example in video of Major League Baseball games the likelihood of the phrase home run increases dramatically when a home run has actually been hit. This paper describes a method for incorporating such visual information in an ASR system for sports video. The method is based on the use of grounded language models to repre sent the relationship between words and the non-linguistic context to which they refer Fleischman and Roy 2007 . Grounded language models are based on .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.