Đang chuẩn bị nút TẢI XUỐNG, xin hãy chờ
Tải xuống
Apress - Smart Home Automation with Linux (2010)- P42:Linux users can now control their homes remotely! Are you a Linux user who has ever wanted to turn on the lights in your house, or open and close the curtains, while away on holiday? Want to be able to play the same music in every room, controlled from your laptop or mobile phone? Do you want to do these things without an expensive off-the-shelf kit | CHAPTER 6 DATA SOURCES Once you are able to describe the location of the data in human terms you can start writing the code The process involves a mechanized agent that is able to load the web page and traverse links and a stream processor that skips over the HTML tags. You begin the scraping with a fairly common loading block like this usr bin perl -w use strict use WWW Mechanize use HTML TokeParser my agent WWW Mechanize- new agent- get http www.minervahome.net news.htm my stream HTML TokeParser- new agent- content Given the stream you can now skip to the fourth table for example by jumping over four of the opening table tags using the following foreach 1.4 stream- get_tag table Notice that get_tag positions the stream point immediately after the opening tag given in this case table. Consequently the stream point is now inside the fourth table. Since our data is on the first row you don t need to worry about skipping the tr tag so you can jump straight into the second column with this stream- get_tag td stream- get_tag td since skipping the td tag will automatically skip the preceding tr. The stream is now positioned exactly where you want it. The HTML structure of this block is as follows a href url Main title a td td valign top Main story text So far I have been using get_tag to skip elements but it also sports a return value containing the contents of the tag. So you d retrieve the information from the anchor with the following which by its nature can return multiple tags my @link stream- get_tag a Since you know there is only one in this particular HTML it is link 0 that is of interest. Inside this is another array containing the following link 0 0 tag link o 1 attributes link o 2 attribute sequence link o 3 text 188 CHAPTER 6 DATA SOURCES Therefore you can extract the link information with the following my href link 0 1 href And since get_tag only retrieves the information about the tag you must return to the stream to extract all the data between this a and