Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Quá trình này có thể lấy tài liệu và phân tích chúng được tự động? Tất nhiên! Khi bắt đầu một chúng ta có thể xây dựng một scraper rằng sẽ tìm thấy các URL của các tài liệu Office (doc, Ppt,. Xls,. Pps). | Google s Part in an Information Collection Framework Chapter 5 201 Figure 5.20 The LinkedIn Profile of the Author of a Government Document Can this process of grabbing documents and analyzing them be automated Of course As a start we can build a scraper that will find the URLs of Office documents .doc .ppt .xls .pps . We then need to download the document and push it through the meta information parser. Finally we can extract the interesting bits and do some post processing on it. We already have a scraper see the previous section and thus we just need something that will extract the meta information from the file.Thomas Springer at ServerSniff.net was kind enough to provide me with the source of his document information script. After some slight changes it looks like this usr bin perl File-analyzer 0.1 07 08 2007 thomas springer stripped-down version slightly modified by roelof temmingh @ paterva.com this code is public domain - use at own risk this code is using phil harveys ExifTool - THANK YOU PHIL http www.ebv4linux.de images articles Phil1.jpg 202 Chapter 5 Google s Part in an Information Collection Framework use strict use Image ExifTool passed parameter is a URL my url @ARGV get file and make a nice filename my file get_page url my time time my frand rand 10000 my fname tmp . time. frand write stuff to a file open FL fname print FL file close FL Get EXIF-INFO my exifTool new Image ExifTool exifTool- Options FastScan 1 exifTool- Options Binary 1 exifTool- Options Unknown 2 exifTool- Options IgnoreMinorErrors 1 my info exifTool- ImageInfo fname feed standard info into a hash delete tempfile unlink fname my @names print Author . info Author . n print LastSaved . info LastSavedBy . n print Creator . info creator . n print Company . info Company . n print Email . info AuthorEmail . n exit comment to see more fields foreach keys info print _ info _ n Google s Part in an Information Collection Framework Chapter 5 203 sub get_page my url @_ use curl to get it - you