TAILIEUCHUNG - A formula to calculate pruning threshold for the part of speech tagging problem
One of crucial factors in the POS (Part-ofSpeech) tagging approaches based on the statistical method is the processing time. In this paper, we propose an approach to calculate the pruning threshold, which can apply into the Viterbi algorithm of Hidden Markov model for tagging the texts in the natural language processing. Experiment on the words on the tag of the Wall Street Journal corpus showed that our proposed solution is satisfactory. | Journal of Science and Technology 54 (3A) (2016) 64-73 A FORMULA TO CALCULATE PRUNING THRESHOLD FOR THE PART-OF-SPEECH TAGGING PROBLEM Nguyen Chi Hieu Industrial University of Ho Chi Minh City, 12 Nguyen Van Bao, Ward 4, Go Vap District, Ho Chi Minh City Email: nchieu@ Received: 1 May 2016; Accepted for Publication: 15 July 2016 ABSTRACT The exact tagging of the words in the texts is a very important task in the natural language processing. It can support parsing the text, contribute to the solution of the polysemous word, and help to access a semantic information, etc. One of crucial factors in the POS (Part-ofSpeech) tagging approaches based on the statistical method is the processing time. In this paper, we propose an approach to calculate the pruning threshold, which can apply into the Viterbi algorithm of Hidden Markov model for tagging the texts in the natural language processing. Experiment on the words on the tag of the Wall Street Journal corpus showed that our proposed solution is satisfactory. Keywords: Hidden Markov model, Part-of-speech tagging, Viterbi algorithm, Beam search. 1. INTRODUCTION The tagging is defined as an automatic assignment of descriptors (or tags) to input tokens. Part-of-speech (POS) tagging is a selecting process to find the most likely sequence of syntactic categories for words in a sentence. It is a very important problem in natural language processing. Several approaches have been developed [1], which include taggers based on handwritten rules, n-gram automatically derived from tagged text corpora, Hidden Markov models, symbolic language models, machine learning, and hybrid taggers [2]. Among the above approaches, one based on the Hidden Markov model (HMM) can offer prominent results [3]. Especially, when using the Viterbi algorithm, it can achieve an accuracy rate of over 95 percent [4]. However, its complexity is a challenge. For a problem involving T words and K lexical categories, the algorithm which is
đang nạp các trang xem trước