Perl HTML scraping part #1

Here we are, back at the scene of the crime. Yes, I know it’s been a while. And the task of the day is:

GOAL:
make an HTML scraper, i.e. a script that grabs another URL and outputs the results to the screen
TOOL:
let’s say … Perl (in my case: Perl 5.8 on RedHat)
INPUT:
a URL
OUTPUT:
the HTML code of that URL

The actual HTML retrieval is easy: you need get() from the LWP::Simple module:
use LWP::Simple;
my $page = get($url);

Some remarks:

  • Since you are generating a web page, you need the CGI module (to take care of the HTTP headers and stuff).
  • The URL input parameter will be given as an HTTP querystring: ?url=http://www.example.com/path/page.htm. When no url parameter given, we will generate a form where it can be filled in.
  • We calculate the time it takes to retrieve the original page
  • #!/usr/bin/perl -w
    use strict;
    use CGI qw(:standard);
    use LWP::Simple qw(!head);my $query = new CGI;
    my $url = $query->param('url');
    my $debug = 0;

    print header();
    if(length($url) > 0) {
    print getpage($url);
    } else {
    showform();
    }

    sub getpage{
    my $url = shift;
    my $time1 = time();
    debuginfo(“Scraping <a target=_blank href='” . $url . “‘>link</a> …”);
    my $page = get($url);
    my $time2 = time();
    debuginfo(“Time taken was <b>” . ($time2 – $time1) . “</b> seconds”);
    debuginfo(“Total bytes scraped: <b>”. length($page)/1000 . “KB</b>” );
    return $page;
    }

    sub debuginfo{
    if ($debug > 0) {
    my $text = shift;
    print “<small>” , $text , “</small><br />n”;
    }

    }

    sub showform{
    print(“<html><head>”);
    print(“<title>SCRAPER</title>”);
    print(“<link rel=stylesheet type=text/css href=http://www.forret.com/blog/style.css>”);
    print(“</head><body><center>n”);
    print(“<form method=GET action=’scrape.pl’>”);
    print(“URL: <input name=url type=text size=60 value=http://www.forret.com>”);
    print(“<input type=submit></form>n”);
    print(“</center></body></html>n”);
    }

    Next step: making sure image src= and hyperlink href keep on working (so convert relative links to absolute links!).

    1 thought on “Perl HTML scraping part #1”

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.