Here we are, back at the scene of the crime. Yes, I know it’s been a while. And the task of the day is:
- GOAL:
- make an HTML scraper, i.e. a script that grabs another URL and outputs the results to the screen
- TOOL:
- let’s say … Perl (in my case: Perl 5.8 on RedHat)
- INPUT:
- a URL
- OUTPUT:
- the HTML code of that URL
The actual HTML retrieval is easy: you need get() from the LWP::Simple module:
use LWP::Simple;
my $page = get($url);
Some remarks:
?url=http://www.example.com/path/page.htm. When no url parameter given, we will generate a form where it can be filled in.
#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
use LWP::Simple qw(!head);my $query = new CGI;
my $url = $query->param(’url’);
my $debug = 0;
print header();
if(length($url) > 0) {
print getpage($url);
} else {
showform();
}
sub getpage{
my $url = shift;
my $time1 = time();
debuginfo(”Scraping <a target=_blank href=’” . $url . “‘>link</a> …”);
my $page = get($url);
my $time2 = time();
debuginfo(”Time taken was <b>” . ($time2 - $time1) . “</b> seconds”);
debuginfo(”Total bytes scraped: <b>”. length($page)/1000 . “KB</b>” );
return $page;
}
sub debuginfo{
if ($debug > 0) {
my $text = shift;
print “<small>” , $text , “</small><br />n”;
}
}
sub showform{
print(”<html><head>”);
print(”<title>SCRAPER</title>”);
print(”<link rel=stylesheet type=text/css href=http://www.forret.com/blog/style.css>”);
print(”</head><body><center>n”);
print(”<form method=GET action=’scrape.pl’>”);
print(”URL: <input name=url type=text size=60 value=http://www.forret.com>”);
print(”<input type=submit></form>n”);
print(”</center></body></html>n”);
}
Next step: making sure image src= and hyperlink href keep on working (so convert relative links to absolute links!).
If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!







0 Responses to “Perl HTML scraping part #1”