TAILIEUCHUNG - Báo cáo khoa học: "Lexicalized phonotactic word segmentation"

This paper presents a new unsupervised algorithm (WordEnds) for inferring word boundaries from transcribed adult conversations. Phone ngrams before and after observed pauses are used to bootstrap a simple discriminative model of boundary marking. This fast algorithm delivers high performance even on morphologically complex words in English and Arabic, and promising results on accurate phonetic transcriptions with extensive pronunciation variation. | Lexicalized phonotactic word segmentation Margaret M. Fleck Department of Computer Science University of Illinois Urbana IL 61801 USA mfleck@ Abstract This paper presents a new unsupervised algorithm WordEnds for inferring word boundaries from transcribed adult conversations. Phone ngrams before and after observed pauses are used to bootstrap a simple discriminative model of boundary marking. This fast algorithm delivers high performance even on morphologically complex words in English and Arabic and promising results on accurate phonetic transcriptions with extensive pronunciation variation. Expanding training data beyond the traditional miniature datasets pushes performance numbers well above those previously reported. This suggests that WordEnds is a viable model of child language acquisition and might be useful in speech understanding. 1 Introduction Words are essential to most models of language and speech understanding. Word boundaries define the places at which speakers can fluently pause and limit the application of most phonological rules. Words are a key constituent in structural analyses the output of morphological rules and the constituents in syntactic parsing. Most speech recognizers are word-based. And words are entrenched in the writing systems of many languages. Therefore it is generally accepted that children learning their first language must learn how to segment speech into a sequence of words. Similar but more limited learning occurs when adults hear speech containing unfamiliar words. These words must be accurately delimited so that they can be added to the lexicon and nearby familiar words recognized correctly. Current speech recognizers typically misinterpret such speech. This paper will consider algorithms which segment phonetically transcribed speech into words. For example Figure 1 shows a transcribed phrase from the Buckeye corpus Pitt et al. 2005 Pitt et al. 2007 and the automatically segmented output. Like almost all .

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.