Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper proposes an input-splitting method for translating spoken-language which includes many long or ill-formed expressions. The proposed method splits input into well-balanced translation units based on a semantic distance calculation. The splitting is performed during left-to-right parsing, and does not degrade translation efficiency. The complete translation result is formed by concatenating the partial translation results of each split unit. The proposed method can be incorporated into frameworks like TDMT, which utilize left-to-right parsing and a score for a substructure. . | Splitting Long or Ill-formed Input for Robust Spoken-language Translation Osamu FURUSE 1 Setsuo YAMADA Kazuhide YAMAMOTO ATR Interpreting Telecommunications Research Laboratories 2-2 Hikaridai Seika-cho Soraku-gun Kyoto 619-0288 Japan furuseỗcslab.keel.ntt.co.jp syamada yamamoto @itl.atr.co.jp Abstract This paper proposes an input-splitting method for translating spoken-language which includes many long or ill-formed expressions. The proposed method splits input into well-balanced translation units based on a semantic distance calculation. The splitting is performed during left-to-right parsing and does not degrade translation efficiency. The complete translation result is formed by concatenating the partial translation results of each split unit. The proposed method can be incorporated into frameworks like TDMT which utilize left-to-right parsing and a score for a substructure. Experimental results show that the proposed method gives TDMT the following advantages 1 elimination of null outputs 2 splitting of utterances into sentences and 3 robust translation of erroneous speech recognition results. 1 Introduction A spoken-language translation system requires the ability to treat long or ill-formed input. An utterance as input of a spoken-language translation system is not always one well-formed sentence. Also when treating an utterance in speech translation the speech recognition result which is the input of the translation component might be corrupted even though the input utterance is well-formed. Such a misrecognized result can cause a parsing failure and consequently no translation output would be produced. Furthermore we cannot expect that a speech recognition result includes punctuation marks such as a comma or a period between words which are useful information for parsing. 1 As a solution for treating long input long-sentence splitting techniques such as that of Current affiliation is NTT Communication Science Laboratories. 1 Punctuation marks are not used