Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Tham khảo tài liệu 'web client programming with perl-chapter 6: example lwp programs-p2', công nghệ thông tin, quản trị web phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả | Chapter 6 Example LWP Programs-P2 Then the scan method does all the real work. The scan method accepts a URL as a parameter. In a nutshell here s what happens The scan method pushes the first URL into a queue. For any URL pulled from the queue any links on that page are extracted from that page and pushed on the queue. To keep track of which URLs have already been visited and not to push them back onto the queue we use an associative array called touched and associate any URL that has been visited with a value of 1. There are other useful variables that are also used to track which document points to what the content-type of the document which links are bad which links are local which links are remote etc. For a more detailed look at how this works let s step through it. First the initial URL is pushed onto a queue push @urls root_url The URL is then checked with a HEAD method. If we can determine that the URL is not an HTML document we can skip it. Otherwise we follow that with a GET method to get the HTML my request new HTTP Request HEAD url my response self- ua - request request if not HTML don t bother to search it for URLs next if response- header Content-Type m@text html@ it is text html get the entity-body this time request- method GET response self- ua - request request Then we extract the links from the HTML page. Here we use our own function to extract the links. There is a similar function in the LWP library that extracts links but we opted not to use it since it is less prone to find links in slightly malformed HTML my @rel_urls grab_urls data foreach verbose_link @rel_urls With each iteration of the foreach loop we process one link. If we haven t seen it before we add it to the queue foreach verbose_link @rel_urls if defined self- touched full child push @urls full_child remember which url we just pushed to avoid repushing self- touched full_child 1 While all of this is going on we keep track of which documents don t exist what their content types are