Here we are, back at the scene of the crime. Yes, I know it’s been a while. And the task of the day is:
- GOAL:
- make an HTML scraper, i.e. a script that grabs another URL and outputs the results to the screen
- TOOL:
- let’s say … Perl (in my case: Perl 5.8 on RedHat)
- INPUT:
- a URL
- OUTPUT:
- the HTML code of that URL
The actual HTML retrieval is easy: you need get()
from the LWP::Simple module:
use LWP::Simple;
my $page = get($url);
Some remarks:
?url=http://www.example.com/path/page.htm
. When no url parameter given, we will generate a form where it can be filled in.
#!/usr/bin/perl -w
use strict;
use CGI qw(:standard);
use LWP::Simple qw(!head);my $query = new CGI;
my $url = $query->param('url');
my $debug = 0;
print header();
if(length($url) > 0) {
print getpage($url);
} else {
showform();
}
sub getpage{
my $url = shift;
my $time1 = time();
debuginfo(“Scraping <a target=_blank href='” . $url . “‘>link</a> …”);
my $page = get($url);
my $time2 = time();
debuginfo(“Time taken was <b>” . ($time2 – $time1) . “</b> seconds”);
debuginfo(“Total bytes scraped: <b>”. length($page)/1000 . “KB</b>” );
return $page;
}
sub debuginfo{
if ($debug > 0) {
my $text = shift;
print “<small>” , $text , “</small><br />n”;
}
}
sub showform{
print(“<html><head>”);
print(“<title>SCRAPER</title>”);
print(“<link rel=stylesheet type=text/css href=http://www.forret.com/blog/style.css>”);
print(“</head><body><center>n”);
print(“<form method=GET action=’scrape.pl’>”);
print(“URL: <input name=url type=text size=60 value=http://www.forret.com>”);
print(“<input type=submit></form>n”);
print(“</center></body></html>n”);
}
Next step: making sure image src=
and hyperlink href
keep on working (so convert relative links to absolute links!).
For generating web pages, we use biterscripting ( http://www.biterscripting.com ). CGI-bin is good, but biterscripting tends to be more powerful and flexible.