Đang chuẩn bị liên kết để tải về tài liệu:
Báo cáo khoa học: "Reformatting Web Documents via Header Trees"

Hoàng Mỹ 65 4 pdf

Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ Tải xuống

We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. | Reformatting Web Documents via Header Trees Minoru Yoshida and Hiroshi Nakagawa Information Technology Center University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 Japan CREST JST mino@r.dl.itc.u-tokyo.ac.jp nakagawa@dl.itc.u-tokyo.ac.jp Abstract We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. 1 Introduction This paper proposes a novel method for reformatting i.e. changing visual representations of web documents. Our final goal is to implement the system that appropriately reformats layouts of web documents by separating semantic aspects like XML from layout aspects like CSS of web documents and changing the layout aspects while retaining the semantic aspects. We propose a header tree which is a reasonable choice as a semantic representation of web documents for this goal. Header trees can be seen as variants of XML trees where each internal node is not an XML tag but a header which is a part of document that can be regarded as tags annotated to other parts of the document. Titles headlines and attributes are examples of headers. The left part of Figure 1 shows an example web document. In this document the headers are About Me which is a title and NAME and AGE which are attributes. For example NAME can be seen as a tag annotated to John Smith. Figure 2 shows a header tree for the example document. It should be noted that each node is labeled with parts of HTML pages not abstract categories such as XML tags. Web Page SeParator Figure 1 An Example Web Document and Conversion from HTML Documents to Block Lists. Therefore the required task is to extract header trees from given web documents. Web documents can be reformatted by converting their

TÀI LIỆU LIÊN QUAN

Kỷ yếu tóm tắt báo cáo khoa học: Hội nghị khoa học tim mạch toàn quốc lần thứ XI - Hội tim mạch Quốc gia Việt Nam

Báo cáo nghiên cứu khoa học: "Danh lục các loài thú ở khu bảo tồn thiên nhiên Pù Huống tỉnh Nghệ An và ý nghĩa bảo tồn nguồn gen quí hiếm của chúng"

Báo cáo khoa học: Hỗ trợ nâng cao năng lực quản lý chất thải sinh hoạt tại thành phố Hội An

Báo cáo nghiên cứu khoa học: "Tính năng động nghệ thuật của văn học hiện đại Việt Nam và một cách nhìn hành trình thể loại"

Báo cáo nghiên cứu khoa học: " DỊCH CHUYỂN TRUY VẤN OQL VÀO CÁC PHÉP TÍNH BAO HÀM"

Báo cáo khoa học: " Áp dụng thủ tục phân tích trong kiểm toán báo cáo tài chính"

Báo cáo nghiên cứu khoa học: "Người lính trở về sau chiến tranh với mặc cảm “ăn mày dĩ vãng’ trong tiểu thuyết Chu Lai"

Báo cáo nghiên cứu khoa học: "Khảo sát hiện tượng chuyển đổi chức năng - nghĩa của động từ tiếng Việt"

Báo cáo nghiên cứu khoa học: " BẢN CHẤT KHOA HỌC VÀ CÁCH MẠNG LÀ CỘI NGUỒN SỨC SỐNG CỦA CHỦ NGHĨA MÁC - LÊNIN"

Báo cáo khoa học: " CẢI TIẾN CÁC THUẬT TOÁN MƯỢN VÀ KHOÁ KÊNH TẦN SỐ MẠNG DI ĐỘNG TẾ BÀO"

Đã phát hiện trình chặn quảng cáo AdBlock

Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.