Perl HTML scraping part #1

21 Jan 2005

Here we are, back at the scene of the crime. Yes, I know it’s been a while. And the task of the day is:

GOAL:: make an HTML scraper, i.e. a script that grabs another URL and outputs the results to the screen
TOOL:: let’s say … Perl (in my case: Perl 5.8 on RedHat)
INPUT:: a URL
OUTPUT:: the HTML code of that URL

The actual HTML retrieval is easy: you need get() from the LWP::Simple module:
use LWP::Simple;<br /> my $page = get($url);

Some remarks:

</p>

Since you are generating a web page, you need the CGI module (to take care of the HTTP headers and stuff).
The URL input parameter will be given as an HTTP querystring: ?url=http://www.example.com/path/page.htm. When no url parameter given, we will generate a form where it can be filled in.
We calculate the time it takes to retrieve the original page

#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
use LWP::Simple qw(!head);my $query = new CGI;<br /> my $url = $query->param('url');<br /> my $debug = 0;

print header();
if(length($url) > 0) {
print getpage($url);
} else {
showform();
}

sub getpage{
my $url = shift;
my $time1 = time();
debuginfo(“Scraping <a target=_blank href=’” . $url . “‘>link</a> …”);
my $page = get($url);
my $time2 = time();
debuginfo(“Time taken was ” . ($time2 – $time1) . “ seconds”);
debuginfo(“Total bytes scraped: ”. length($page)/1000 . “KB” );
return $page;
}

sub debuginfo{
if ($debug > 0) {
my $text = shift;
print “” , $text , “
n”;
}

}

sub showform{
print(“<html><head>”);
print(“SCRAPER”);
print(“<link rel=stylesheet type=text/css href=http://www.forret.com/blog/style.css>”);
print(“</head><body>

n”); print(“<form method=GET action='scrape.pl'>”); print(“URL: <input name=url type=text size=60 value=http://www.forret.com>”); print(“</form>n”); print(“</body></html>n”);
}

Next step: making sure image src= and hyperlink href keep on working (so convert relative links to absolute links!).

Peter Forret

Perl HTML scraping part #1

Also on this blog ...