Archive for the 'Linux' Category

Page 2 of 4

Perl HTML scraping part #1

Here we are, back at the scene of the crime. Yes, I know it’s been a while. And the task of the day is:

GOAL:
make an HTML scraper, i.e. a script that grabs another URL and outputs the results to the screen
TOOL:
let’s say … Perl (in my case: Perl 5.8 on RedHat)
INPUT:
a URL
OUTPUT:
the HTML code of that URL

The actual HTML retrieval is easy: you need get() from the LWP::Simple module:
use LWP::Simple;
my $page = get($url);

Some remarks:

  • Since you are generating a web page, you need the CGI module (to take care of the HTTP headers and stuff).
  • The URL input parameter will be given as an HTTP querystring: ?url=http://www.example.com/path/page.htm. When no url parameter given, we will generate a form where it can be filled in.
  • We calculate the time it takes to retrieve the original page
  • #!/usr/bin/perl -w
    use strict;
    use CGI qw(:standard);
    use LWP::Simple qw(!head);my $query = new CGI;
    my $url = $query->param('url');
    my $debug = 0;

    print header();
    if(length($url) > 0) {
    print getpage($url);
    } else {
    showform();
    }

    sub getpage{
    my $url = shift;
    my $time1 = time();
    debuginfo(“Scraping <a target=_blank href=’” . $url . “‘>link</a> …”);
    my $page = get($url);
    my $time2 = time();
    debuginfo(“Time taken was <b>” . ($time2 – $time1) . “</b> seconds”);
    debuginfo(“Total bytes scraped: <b>”. length($page)/1000 . “KB</b>” );
    return $page;
    }

    sub debuginfo{
    if ($debug > 0) {
    my $text = shift;
    print “<small>” , $text , “</small><br />n”;
    }

    }

    sub showform{
    print(“<html><head>”);
    print(“<title>SCRAPER</title>”);
    print(“<link rel=stylesheet type=text/css href=http://www.forret.com/blog/style.css>”);
    print(“</head><body><center>n”);
    print(“<form method=GET action=’scrape.pl’>”);
    print(“URL: <input name=url type=text size=60 value=http://www.forret.com>”);
    print(“<input type=submit></form>n”);
    print(“</center></body></html>n”);
    }

    Next step: making sure image src= and hyperlink href keep on working (so convert relative links to absolute links!).

    Squid cachemgr.cgi UI hack

    Squid has a little system statistics viewer built-in:

    The cache manager (cachemgr.cgi) is a CGI utility for displaying statistics about the squid process as it runs. The cache manager is a convenient way to manage the cache and view statistics without logging into the server.
    (from Squid FAQ)

    The only thing is … it’s so ugly! It uses plain HTML and cannot be customized, the FAQ says. However, there is a way to do it:

    1. copy cachemgr.cgi to cachemgr2.cgi so if you do something wrong, the original is not lost.
    2. open the CGI file in a text-editor. I used vi, but if you’re not used to working with it, use something else (emacs?).
    3. in the binary file, look for some text portions that look like HTML code
    4. while keeping in mind that the # of characters should remain the same, change the <title> and <style> to something that suits you. You will have to do this at 2 locations in the file: one for the homepage template and one for the other pages’ template.
    5. suggestion: just let the CGI use a style.css file that you drop into the same folder.
      <link rel="stylesheet" type="text/css" href="style.css" mce_href="style.css" /> and fill up with spaces to keep the same # characters
    6. verify that the cachemgr and the cachemgr2 have the same # bytes
    7. now use cachemgr2 to display your statistics.
    8. I did something a bit different (I wanted to use the CSS of my own website), so I ‘ll show you the difference between the two versions.
      In order to get to the following comparison, I did a strings cachemgr.cgi > cachemgr.txt to extract only the text parts, and I did a diff cachemgr.txt cachemgr2.txt to compare both files. You cannot do a file comparison of 2 binary files.

      <em>173,174c173,174</em>
      < <HTML><HEAD><TITLE>Cache Manager Interface</TITLE>
      < <STYLE type="text/css"><!-- BODY{background-color:#ffffff;font-family:verdana,sans-serif} --></STYLE></HEAD>
      ---
      > <HTML><HEAD><TITLE>Cache Manager (pforret)</TITLE>
      > <link rel="stylesheet" type="text/css" href="http://www.forret.com/forret/forret.css" mce_href="http://www.forret.com/forret/forret.css" /> </HEAD>
      <em>199c199</em>
      < <STYLE type="text/css"><!-- BODY{background-color:#ffffff;font-family:verdana,sans-serif} TABLE{background-color:#333333;border:0pt;padding:0pt}TH,TD{background-color:#ffffff}--></STYLE>
      ---
      > <link rel="stylesheet" type=text/css href="http://www.forret.com/forret/forret.css" mce_href="http://www.forret.com/forret/forret.css"><!-- TABLE{background-color:#333333;border:0pt;padding:0pt} TH,TD{background-color:#ffffff}--></STYLE>

      Probe disk performance (MRTG)

      The hdparam can be used to monitor the throughput speed of a hard disk:
      # <strong>hdparm -tT /dev/hda</strong>
      /dev/hda:
      Timing buffer-cache reads: 888 MB in 2.00 seconds = 444.00 MB/sec
      Timing buffered disk reads: 20 MB in 3.30 seconds = 6.06 MB/sec

      This would be an interesting performance metric to see plotted against time. So let’s convert it to a format ready for MRTG.

      • The only numbers we need are the last ones: resulting speed. This can be parsed from the output as follows:
        #hdparm -tT /dev/hda | gawk -F = "/seconds/ { print $2}" 

        440.00 MB/sec   3.30 MB/sec
      • if we could suppose that the results will always be in “MB/sec”, we could parse out the numbers with
        (...) | gawk "{print $1}"
        and then add a line to our MRTG config files to adjust the units:
        kMG[_]: M,G,T,P,X
        But let’s say that KB/sec or GB/sec speeds are possible.
      • One gawk can do the conversion trick:
        #(...) | gawk "/GB/ {print $1*1000000000} /MB/ {print $1*1000000} /KB/ {print $1*1000}" 

        440000000 3300000
      • To have a complete MRTG-ready output, we also add the boot time on line 3 and the name of the MRTG output on line 4
      • Q: Do we need 2 gawks one after the other? Can’t one do it?
        A: You could do it in 1, I guess, but the parsing would be more complex. I use 2 because the FS (field separator) changes: the first gawk uses the ‘=’ character, the second uses the normal whitespace.

      Date formatting in GAWK: boot time

      I have one server with apparently an exceptional stability:
      # uptime

      3:45pm  up 524 days,  1:22,  1 user,  load average: 0.44, 0.16, 0.13

      Unfortunately I know this is not correct (I remember rebooting it some weeks ago). So what are other ways to get the date/time of the last boot?

      Looking at the RedHat manuals, the following thing should work too:
      # cat /proc/stat
      cpu 33813143 210619911 30093342 59435750
      cpu0 33813143 210619911 30093342 59435749
      (...)
      btime 1096157569
      (...)

      The btime gives us the last boot time in seconds since 1 Jan 1970. I can find and convert it with gawk:
      # gawk "/btime/{ print (`date +%s` - $2) / (3600 * 24.0) ,"days -",strftime("%a %b %d %H:%M:%S %Z %Y",$2)}" /proc/stat
      38.6473 days - Sun Sep 26 02:12:49 CEST 2004

      Which gives us an uptime of 38,6 days – that looks more like it!

      Another way of calculating the uptime:
      # gawk "/cpu/ {print $1,($2 + $3 + $4 + $5)/(3600 * 24 * 100)}" /proc/stat
      cpu 38.6515
      cpu0 38.6515

      Confirmation of the previous measurement!

      # cat /proc/uptime
      45282758.17 663091.26

      The first number is the # of seconds since last boot. The other one (idle time) we don’t need. What is that in days?
      # gawk "{print $1/(3600 * 24.0)}" /proc/uptime
      524.106

      This is where the wrong data is coming from! So I’ll ignore this data.

      Remark: This server is one of my oldest ones and is still running Redhat 7.2 (Enigma). Looks like this bug was fixed in later versions of RedHat, since none of my other servers have it.