TAILIEUCHUNG - Báo cáo khoa học: "Reformatting Web Documents via Header Trees"

We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. | Reformatting Web Documents via Header Trees Minoru Yoshida and Hiroshi Nakagawa Information Technology Center University of Tokyo 7-3-1 Hongo Bunkyo-ku Tokyo 113-0033 Japan CREST JST mino@ nakagawa@ Abstract We propose a new method for reformatting web documents by extracting semantic structures from web pages. Our approach is to extract trees that describe hierarchical relations in documents. We developed an algorithm for this task by employing the EM algorithm and clustering techniques. Preliminary experiments showed that our approach was more effective than baseline methods. 1 Introduction This paper proposes a novel method for reformatting . changing visual representations of web documents. Our final goal is to implement the system that appropriately reformats layouts of web documents by separating semantic aspects like XML from layout aspects like CSS of web documents and changing the layout aspects while retaining the semantic aspects. We propose a header tree which is a reasonable choice as a semantic representation of web documents for this goal. Header trees can be seen as variants of XML trees where each internal node is not an XML tag but a header which is a part of document that can be regarded as tags annotated to other parts of the document. Titles headlines and attributes are examples of headers. The left part of Figure 1 shows an example web document. In this document the headers are About Me which is a title and NAME and AGE which are attributes. For example NAME can be seen as a tag annotated to John Smith. Figure 2 shows a header tree for the example document. It should be noted that each node is labeled with parts of HTML pages not abstract categories such as XML tags. Web Page SeParator Figure 1 An Example Web Document and Conversion from HTML Documents to Block Lists. Therefore the required task is to extract header trees from given web documents. Web documents can be reformatted by converting their

Hoàng Mỹ 65 4 pdf

Upload

Bấm vào đây để xem trước nội dung

Tải xuống

TÀI LIỆU LIÊN QUAN

Báo cáo khoa học: "Reformatting Web Documents via Header Trees"

4 57 0

TÀI LIỆU XEM NHIỀU

Một Case Về Hematology (1)

8 462292 61

Giới thiệu :Lập trình mã nguồn mở

14 24934 79

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11287 542

Câu hỏi và đáp án bài tập tình huống Quản trị học

14 10511 466

Phân tích và làm rõ ý kiến sau: “Bài thơ Tự tình II vừa nói lên bi kịch duyên phận vừa cho thấy khát vọng sống, khát vọng hạnh phúc của Hồ Xuân Hương”

3 9791 108

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Tiểu luận: Nội dung tư tưởng Hồ Chí Minh về đạo đức

16 8467 426

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7473 1763

Đề tài: Dự án kinh doanh thời trang quần áo nữ

17 7189 268

TỪ KHÓA LIÊN QUAN

TÀI LIỆU MỚI ĐĂNG

báo cáo hóa học:" Increased androgen receptor expression in serous carcinoma of the ovary is associated with an improved survival"

6 150 3 27-11-2024

Báo cáo nghiên cứu nông nghiệp " Biofertiliser inoculant technology for the growth of rice in Vietnam: Developing technical infrastructure for quality assurance and village production for farmers "

12 132 2 27-11-2024

Báo cáo nghiên cứu khoa học " HÃY LÀM CHO HUẾ XANH HƠN VÀ ĐẸP HƠN "

6 168 3 27-11-2024

Hướng dẫn chế độ dinh dưỡng cho người bệnh viêm khớp

5 159 2 27-11-2024

báo cáo hóa học:" Perceptions of rewards among volunteer caregivers of people living with AIDS working in faith-based organizations in South Africa: a qualitative study"

10 146 1 27-11-2024

Sử dụng mô hình ARCH và GARCH để phân tích và dự báo về giá cổ phiếu trên thị trường chứng khoán

24 1064 2 27-11-2024

IT Audit: EMC’s Journey to the Private Cloud

13 150 1 27-11-2024

Lập trình Java cơ bản : Luồng và xử lý file part 8

5 133 1 27-11-2024

Business English Lesson – Advanced Level's archiveFinance (1)

8 108 0 27-11-2024

NGUỒN GỐC CÂY KHOAI LANG

3 120 1 27-11-2024

TÀI LIỆU HOT

Mẫu đơn thông tin ứng viên ngân hàng VIB

8 8090 2279

Giáo trình Tư tưởng Hồ Chí Minh - Mạch Quang Thắng (Dành cho bậc ĐH - Không chuyên ngành Lý luận chính trị)

152 7473 1763

Ebook Chào con ba mẹ đã sẵn sàng

112 4364 1369

Ebook Tuyển tập đề bài và bài văn nghị luận xã hội: Phần 1

62 6156 1259

Ebook Facts and Figures – Basic reading practice: Phần 1 – Đặng Tuấn Anh (Dịch)

249 8876 1160

Giáo trình Văn hóa kinh doanh - PGS.TS. Dương Thị Liễu

561 3790 680

Giáo trình Sinh lí học trẻ em: Phần 1 - TS Lê Thanh Vân

122 3909 609

Giáo trình Pháp luật đại cương: Phần 1 - NXB ĐH Sư Phạm

274 4618 562

Tiểu luận: Tư tưởng Hồ Chí Minh về xây dựng nhà nước trong sạch vững mạnh

13 11287 542

Bài tập nhóm quản lý dự án: Dự án xây dựng quán cafe

35 4454 490