Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
This paper proposes a novel method of building polarity-tagged corpus from HTML documents. The characteristics of this method is that it is fully automatic and can be applied to arbitrary HTML documents. The idea behind our method is to utilize certain layout structures and linguistic pattern. By using them, we can automatically extract such sentences that express opinion. In our experiment, the method could construct a corpus consisting of 126,610 sentences. | Automatic Construction of Polarity-tagged Corpus from HTML Documents Nobuhiro Kaji and Masaru Kitsuregawa Institute of Industrial Science the University of Tokyo 4-6-1 Komaba Meguro-ku Tokyo 153-8505 Japan kaji kitsure @tkl.iis.u-tokyo.ac.jp Abstract This paper proposes a novel method of building polarity-tagged corpus from HTML documents. The characteristics of this method is that it is fully automatic and can be applied to arbitrary HTML documents. The idea behind our method is to utilize certain layout structures and linguistic pattern. By using them we can automatically extract such sentences that express opinion. In our experiment the method could construct a corpus consisting of 126 610 sentences. 1 Introduction Recently there has been an increasing interest in such applications that deal with opinions a.k.a. sentiment reputation etc. . For instance Mori-naga et al. developed a system that extracts and analyzes reputations on the Internet Morinaga et al. 2002 . Pang et al. proposed a method of classifying movie reviews into positive and negative ones Pang et al. 2002 . In these applications one of the most important issue is how to determine the polarity or semantic orientation of a given text. In other words it is necessary to decide whether a given text conveys positive or negative content. In order to solve this problem we intend to take statistical approach. More specifically we plan to learn the polarity of texts from a corpus in which phrases sentences or documents are tagged with labels expressing the polarity polarity-tagged corpus . So far this approach has been taken by a lot of researchers Pang et al. 2002 Dave et al. 2003 Wilson et al. 2005 . In these previous works polarity-tagged corpus was built in either of the following two ways. It is built manually or created from review sites such as AMAZON.COM. In some review sites the review is associated with metadata indicating its polarity. Those reviews can be used as polarity-tagged corpus. In case