Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. | Reformatting Web Documents via Header Trees Minoru Yoshida and Hiroshi Nakagawa Information Technology Center University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 Japan CREST JST mino@r.dl.itc.u-tokyo.ac.jp nakagawa@dl.itc.u-tokyo.ac.jp Abstract We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. 1 Introduction This paper proposes a novel method for reformatting i.e. changing visual representations of web documents. Our final goal is to implement the system that appropriately reformats layouts of web documents by separating semantic aspects like XML from layout aspects like CSS of web documents and changing the layout aspects while retaining the semantic aspects. We propose a header tree which is a reasonable choice as a semantic representation of web documents for this goal. Header trees can be seen as variants of XML trees where each internal node is not an XML tag but a header which is a part of document that can be regarded as tags annotated to other parts of the document. Titles headlines and attributes are examples of headers. The left part of Figure 1 shows an example web document. In this document the headers are About Me which is a title and NAME and AGE which are attributes. For example NAME can be seen as a tag annotated to John Smith. Figure 2 shows a header tree for the example document. It should be noted that each node is labeled with parts of HTML pages not abstract categories such as XML tags. Web Page SeParator Figure 1 An Example Web Document and Conversion from HTML Documents to Block Lists. Therefore the required task is to extract header trees from given web documents. Web documents can be reformatted by converting their