TAILIEUCHUNG - Báo cáo khoa học: "Reformatting Web Documents via Header Trees"

We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. | Reformatting Web Documents via Header Trees Minoru Yoshida and Hiroshi Nakagawa Information Technology Center University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 Japan CREST JST mino@ nakagawa@ Abstract We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. 1 Introduction This paper proposes a novel method for reformatting . changing visual representations of web documents. Our final goal is to implement the system that appropriately reformats layouts of web documents by separating semantic aspects like XML from layout aspects like CSS of web documents and changing the layout aspects while retaining the semantic aspects. We propose a header tree which is a reasonable choice as a semantic representation of web documents for this goal. Header trees can be seen as variants of XML trees where each internal node is not an XML tag but a header which is a part of document that can be regarded as tags annotated to other parts of the document. Titles headlines and attributes are examples of headers. The left part of Figure 1 shows an example web document. In this document the headers are About Me which is a title and NAME and AGE which are attributes. For example NAME can be seen as a tag annotated to John Smith. Figure 2 shows a header tree for the example document. It should be noted that each node is labeled with parts of HTML pages not abstract categories such as XML tags. Web Page SeParator Figure 1 An Example Web Document and Conversion from HTML Documents to Block Lists. Therefore the required task is to extract header trees from given web documents. Web documents can be reformatted by converting their

TAILIEUCHUNG - Chia sẻ tài liệu không giới hạn
Địa chỉ : 444 Hoang Hoa Tham, Hanoi, Viet Nam
Website : tailieuchung.com
Email : tailieuchung20@gmail.com
Tailieuchung.com là thư viện tài liệu trực tuyến, nơi chia sẽ trao đổi hàng triệu tài liệu như luận văn đồ án, sách, giáo trình, đề thi.
Chúng tôi không chịu trách nhiệm liên quan đến các vấn đề bản quyền nội dung tài liệu được thành viên tự nguyện đăng tải lên, nếu phát hiện thấy tài liệu xấu hoặc tài liệu có bản quyền xin hãy email cho chúng tôi.
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.