Perl HTML scraping part #1
21 Jan 2005Here we are, back at the scene of the crime. Yes, I know it’s been a while. And the task of the day is:
- GOAL:
- make an HTML scraper, i.e. a script that grabs another URL and outputs the results to the screen
- TOOL:
- let’s say … Perl (in my case: Perl 5.8 on RedHat)
- INPUT:
- a URL
- OUTPUT:
- the HTML code of that URL
The actual HTML retrieval is easy: you need get()
from the LWP::Simple module:
use LWP::Simple;<br />
my $page = get($url);
Some remarks:
</p>
-
Since you are generating a web page, you need the CGI module (to take care of the HTTP headers and stuff).
-
The URL input parameter will be given as an HTTP querystring:
?url=http://www.example.com/path/page.htm
. When no url parameter given, we will generate a form where it can be filled in. -
We calculate the time it takes to retrieve the original page
``
#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
use LWP::Simple qw(!head);my $query = new CGI;<br />
my $url = $query->param('url');<br />
my $debug = 0;
print header();
if(length($url) > 0) {
print getpage($url);
} else {
showform();
}
sub getpage{
my $url = shift;
my $time1 = time();
debuginfo(“Scraping <a target=_blank href=’” . $url . “‘>link</a> …”);
my $page = get($url);
my $time2 = time();
debuginfo(“Time taken was ” . ($time2 – $time1) . “ seconds”);
debuginfo(“Total bytes scraped: ”. length($page)/1000 . “KB” );
return $page;
}
sub debuginfo{
if ($debug > 0) {
my $text = shift;
print “” , $text , “
n”;
}
}
sub showform{
print(“<html><head>”);
print(“
print(“<link rel=stylesheet type=text/css href=http://www.forret.com/blog/style.css>”);
print(“</head><body>
}
Next step: making sure image src=
and hyperlink href
keep on working (so convert relative links to absolute links!).