Archive for the 'Linux' Category

Redirecting with Apache’s .htaccess

When you migrate web sites from one place to another, and the URLS change, you don’t want to lose visitors that still use the old links. If your ‘old’ website ran on Apache, you can use its mod_alias/mod_rewrite functionality to automatically redirect to the new URL. This involves adding redirect rules to the .htaccess file in the base folder of the redirects. Some examples:

Generic structure of the .htaccess redirects

Redirect permanent /(old url) (new url)
Redirect … (add all your one-2-one redirects here)
RedirectMatch permanent ^/old_stuff/.*html$ http://www.example.com/
RedirectMatch … (add your catch-all redirects here)

RewriteEngine on
RewriteBase /blog/
RewriteRule ^([regex])$ http://blog.example.com/$1 [R,L]
RewriteRule … (add all your variable redirects here)

EXAMPLE: old Blogger site (on your own server) to new Wordpress site
I’ve done a migration from a blog published by Blogger (via FTP) onto my own webspace, to a blog run by Wordpress. I’ve used the following Rewrite rules to handle the redirections.
* HOMEPAGE:
redirect /index.html and / to your new blog URL
Redirect permanent / http://blog.example.com/
Redirect permanent /index.html http://blog.example.com/

* FEED:
redirect e.g. /atom.xml to your Feedburner feed
Redirect permanent /atom.xml http://feeds.feedburner.com/(exampleblog)

* ARCHIVES:
redirect e.g. /archive/2005_03_posts.html to the new Wordpress archives
RedirectMatch permanent /archive/([0-9][0-9][0-9][0-9])_([0-9][0-9])_.*$ http://blog.example.com/$1/$2/

* POST PAGES:
This is tricky, because Blogger and Wordpress do not use exactly the same rules for constructing the text-like URL (the ‘post slug’). E.g a post called how-to-podcast-with-blogger-and.html on my old Blogger site became how-to-podcast-with-blogger-and-smartcast/ on the new Wordpress one. So what I did consisted of 2 type of rules:
a) redirecting individual pages
Redirect permanent 2004/10/how-to-podcast-with-blogger-and.html http://blog.example.com/2004/10/how-to-podcast-with-blogger-and-smartcast/
b) a generic rule for the others (this uses Rewrite instead of RedirectMatch!): each page is redirected to a search on the Wordpress blog within the correct month with the two first words of the title:
RewriteRule ^([0-9][0-9][0-9][0-9])/([0-9][0-9])/([a-z0-9]*)-([a-z0-9]*).*$ http://blog.example.com/$1/$2/?s=$3+$4 [R,L]
This method is far from perfect, but will bring visitors a lot closer to the right page. If you use pretty distinctive words for titles (e.g. “Myspace: bulletin and other spam“), chances are the right page show up first. If you start all your posts with “The ten best ways to …” then you will need a more sophisticated rule; e.g. using the 6th and 7th word:
RewriteRule ^([0-9][0-9][0-9][0-9])/([0-9][0-9])/[a-z0-9]*-[a-z0-9]*-[a-z0-9]*-[a-z0-9]*-[a-z0-9]*-([a-z0-9]*)-([a-z0-9]*).*$ http://blog.example.com/$1/$2/?s=$3+$4 [R,L]

Not losing the querystring
Redirect and RedirectMatch cannot redirect to a URL with a querystring (e.g. to newpage.php?param1=val1&param2=val2). For that you will need to use the RewriteRule. An example: redirect all links like test.asp?param=value on the old domain to the new domain while keeping all querystring parameters:
RewriteRule ^tools/test.asp\??(.*)$ http://web.example.com/tools/test.asp [L,QSA]
where the QSA = (query string append) keep existing querystring, and L = (last rule) stop looking further for rule matches.

If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!

Convert Bind DNS zone into PTR records

The following script I made in order to convert the forward DNS records in a /var/named/db.[domain] file into the correct format for a reverse DNS db.[subnet prefix] file.

#!/bin/sh
(...)
DNSROOT=/var/named
PREFIX=$1
DOMAIN=$2
shift 2
DNSPRE=$DNSROOT/db.$PREFIX
DNSDOM=$DNSROOT/db.$DOMAIN
echo "; save this in $DNSPRE"
(
if [ -f $DNSDOM ] ; then
cat $DNSDOM
| grep $PREFIX
| grep -w "A"
| sed "s/$PREFIX.*//g"
| gawk "BEGIN {OFS = "t" ;} {print $4,"IN","PTR",$1 ".$DOMAIN.",";; FROM `basename $DNSDOM`" }"
fi

if [ -f $DNSPRE ] ; then
cat $DNSPRE
| grep -w "PTR"
| gawk "BEGIN {OFS = "t" ;} {print $1,$2,$3,$4,";; FROM `basename $DNSPRE` "; }"
fi )
| sort -n
| uniq --check-chars=3

You would call it as follows:
revdns.sh 192.168.110 internal.example.com > new.db.192.168.110 and then replace the records of the original db.192.168.110 with the records of the new file. The script still requires manual intervention (you cannot pipe the result straight into a live Bind config file) but saves a lot of typing!

Example of the output:

201 IN PTR james.internal.example.be. ;; FROM db.internal.example.com
202 IN PTR wilbur.internal.example.be. ;; FROM db.internal.example.com
216 IN PTR appprd1.internal.example.com. ;; FROM db.192.168.110
217 IN PTR appprd2.internal.example.com. ;; FROM db.192.168.110
218 IN PTR appprd3.internal.example.com. ;; FROM db.192.168.110
219 IN PTR appprd4.internal.example.com. ;; FROM db.192.168.110
220 IN PTR appprd5.internal.example.com. ;; FROM db.192.168.110
221 IN PTR appprd6.internal.example.com. ;; FROM db.192.168.110

Installing NTP (time synchronisation)

Set timezone (optional)
create symbolical link from /usr/share/zoneinfo/... to /etc/localtime:
ln -sf /usr/share/zoneinfo/Europe/Brussels /etc/localtime
Set UTC mode (optional)
if your hardware clock runs in UTC (Universal Coordinated Time) mode, add
UTC=true
to the /etc/sysconfig/clock file
Make sure ntpd is not running
Use service ntpd stop to stop it.
Choose the NTP server you will get your time from
it can be an internal server that has the NTP service open for clients, or an public NTP server. To be sure, use 2 servers. To check if you can access it, run ntpdate timeserver.ntp.ch
Edit the /etc/ntp.conf file
Rename the current file to ntp.bak.conf and make a small new one:
restrict default ignore
server timeserver.ntp.ch # Swiss time
server ntp.ucsd.edu # Univ of California, San Diego
restrict timeserver.ntp.ch mask 255.255.255.255 nomodify notrap noquery
restrict ntp.ucsd.edu mask 255.255.255.255 nomodify notrap noquery
server 127.127.1.0 # local clock
fudge 127.127.1.0 stratum 10 #so it only takes over if the rest fails
restrict 127.0.0.1 driftfile /etc/ntp/drift broadcastdelay 0.008 authenticate no
Set your system clock right
Run the following command a couple of times:
ntpdate -u timeserver.ntp.ch # or whatever server you want to use
You will see the initial diffence in time go away afer the 2nd or 3rd time.
Set hardware clock
/sbin/hwclock --systohc
Run the ntpd daemon
service ntpd start
Add ntpd to the services started at boot time
chkconfig ntpd on
Check the NTP results
ntpd -p
will show you what the difference is between your clock and that of the servers you added. You are looking for lines like

remote refid st t when poll reach delay offset jitter
==========================================================================
LOCAL LOCAL 10 l 30 64 377 0.000 0.000 0.004 *
192.168.246.107 192.168.246.88 3 u 41 128 177 0.313 5.598 0.345

and not lines like

remote refid st t when poll reach delay offset jitter
==========================================================================
192.168.246.126 LOCAL 11 u 37 128 375 0.204 6082.02 6069.84

Jitter is too high!

Perl HTML scraping part #1

Here we are, back at the scene of the crime. Yes, I know it’s been a while. And the task of the day is:

GOAL:
make an HTML scraper, i.e. a script that grabs another URL and outputs the results to the screen
TOOL:
let’s say … Perl (in my case: Perl 5.8 on RedHat)
INPUT:
a URL
OUTPUT:
the HTML code of that URL

The actual HTML retrieval is easy: you need get() from the LWP::Simple module:
use LWP::Simple;
my $page = get($url);

Some remarks:

  • Since you are generating a web page, you need the CGI module (to take care of the HTTP headers and stuff).
  • The URL input parameter will be given as an HTTP querystring: ?url=http://www.example.com/path/page.htm. When no url parameter given, we will generate a form where it can be filled in.
  • We calculate the time it takes to retrieve the original page
  • #!/usr/bin/perl -w
    use strict;
    use CGI qw(:standard);
    use LWP::Simple qw(!head);my $query = new CGI;
    my $url = $query->param(’url’);
    my $debug = 0;

    print header();
    if(length($url) > 0) {
    print getpage($url);
    } else {
    showform();
    }

    sub getpage{
    my $url = shift;
    my $time1 = time();
    debuginfo(”Scraping <a target=_blank href=’” . $url . “‘>link</a> …”);
    my $page = get($url);
    my $time2 = time();
    debuginfo(”Time taken was <b>” . ($time2 - $time1) . “</b> seconds”);
    debuginfo(”Total bytes scraped: <b>”. length($page)/1000 . “KB</b>” );
    return $page;
    }

    sub debuginfo{
    if ($debug > 0) {
    my $text = shift;
    print “<small>” , $text , “</small><br />n”;
    }

    }

    sub showform{
    print(”<html><head>”);
    print(”<title>SCRAPER</title>”);
    print(”<link rel=stylesheet type=text/css href=http://www.forret.com/blog/style.css>”);
    print(”</head><body><center>n”);
    print(”<form method=GET action=’scrape.pl’>”);
    print(”URL: <input name=url type=text size=60 value=http://www.forret.com>”);
    print(”<input type=submit></form>n”);
    print(”</center></body></html>n”);
    }

    Next step: making sure image src= and hyperlink href keep on working (so convert relative links to absolute links!).

    Squid cachemgr.cgi UI hack

    Squid has a little system statistics viewer built-in:

    The cache manager (cachemgr.cgi) is a CGI utility for displaying statistics about the squid process as it runs. The cache manager is a convenient way to manage the cache and view statistics without logging into the server.
    (from Squid FAQ)

    The only thing is … it’s so ugly! It uses plain HTML and cannot be customized, the FAQ says. However, there is a way to do it:

    1. copy cachemgr.cgi to cachemgr2.cgi so if you do something wrong, the original is not lost.
    2. open the CGI file in a text-editor. I used vi, but if you’re not used to working with it, use something else (emacs?).
    3. in the binary file, look for some text portions that look like HTML code
    4. while keeping in mind that the # of characters should remain the same, change the <title> and <style> to something that suits you. You will have to do this at 2 locations in the file: one for the homepage template and one for the other pages’ template.
    5. suggestion: just let the CGI use a style.css file that you drop into the same folder.
      <link rel="stylesheet" type="text/css" href="style.css" mce_href="style.css" /> and fill up with spaces to keep the same # characters
    6. verify that the cachemgr and the cachemgr2 have the same # bytes
    7. now use cachemgr2 to display your statistics.
    8. I did something a bit different (I wanted to use the CSS of my own website), so I ‘ll show you the difference between the two versions.
      In order to get to the following comparison, I did a strings cachemgr.cgi > cachemgr.txt to extract only the text parts, and I did a diff cachemgr.txt cachemgr2.txt to compare both files. You cannot do a file comparison of 2 binary files.

      <em>173,174c173,174</em>
      < <HTML><HEAD><TITLE>Cache Manager Interface</TITLE>
      < <STYLE type="text/css"><!-- BODY{background-color:#ffffff;font-family:verdana,sans-serif} --></STYLE></HEAD>
      ---
      > <HTML><HEAD><TITLE>Cache Manager (pforret)</TITLE>
      > <link rel="stylesheet" type="text/css" href="http://www.forret.com/forret/forret.css" mce_href="http://www.forret.com/forret/forret.css" /> </HEAD>
      <em>199c199</em>
      < <STYLE type="text/css"><!-- BODY{background-color:#ffffff;font-family:verdana,sans-serif} TABLE{background-color:#333333;border:0pt;padding:0pt}TH,TD{background-color:#ffffff}--></STYLE>
      ---
      > <link rel="stylesheet" type=text/css href="http://www.forret.com/forret/forret.css" mce_href="http://www.forret.com/forret/forret.css"><!-- TABLE{background-color:#333333;border:0pt;padding:0pt} TH,TD{background-color:#ffffff}--></STYLE>

      Probe disk performance (MRTG)

      The hdparam can be used to monitor the throughput speed of a hard disk:
      # <strong>hdparm -tT /dev/hda</strong>
      /dev/hda:
      Timing buffer-cache reads: 888 MB in 2.00 seconds = 444.00 MB/sec
      Timing buffered disk reads: 20 MB in 3.30 seconds = 6.06 MB/sec

      This would be an interesting performance metric to see plotted against time. So let’s convert it to a format ready for MRTG.

      • The only numbers we need are the last ones: resulting speed. This can be parsed from the output as follows:
        #hdparm -tT /dev/hda | gawk -F = "/seconds/ { print $2}" 

        440.00 MB/sec   3.30 MB/sec
      • if we could suppose that the results will always be in “MB/sec”, we could parse out the numbers with
        (...) | gawk "{print $1}"
        and then add a line to our MRTG config files to adjust the units:
        kMG[_]: M,G,T,P,X
        But let’s say that KB/sec or GB/sec speeds are possible.
      • One gawk can do the conversion trick:
        #(...) | gawk "/GB/ {print $1*1000000000} /MB/ {print $1*1000000} /KB/ {print $1*1000}" 

        440000000 3300000
      • To have a complete MRTG-ready output, we also add the boot time on line 3 and the name of the MRTG output on line 4
      • Q: Do we need 2 gawks one after the other? Can’t one do it?
        A: You could do it in 1, I guess, but the parsing would be more complex. I use 2 because the FS (field separator) changes: the first gawk uses the ‘=’ character, the second uses the normal whitespace.

      Date formatting in GAWK: boot time

      I have one server with apparently an exceptional stability:
      # uptime

      3:45pm  up 524 days,  1:22,  1 user,  load average: 0.44, 0.16, 0.13

      Unfortunately I know this is not correct (I remember rebooting it some weeks ago). So what are other ways to get the date/time of the last boot?

      Looking at the RedHat manuals, the following thing should work too:
      # cat /proc/stat
      cpu 33813143 210619911 30093342 59435750
      cpu0 33813143 210619911 30093342 59435749
      (…)
      btime 1096157569
      (…)

      The btime gives us the last boot time in seconds since 1 Jan 1970. I can find and convert it with gawk:
      # gawk “/btime/{ print (`date +%s` - $2) / (3600 * 24.0) ,”days -”,strftime(”%a %b %d %H:%M:%S %Z %Y”,$2)}” /proc/stat
      38.6473 days - Sun Sep 26 02:12:49 CEST 2004

      Which gives us an uptime of 38,6 days – that looks more like it!

      Another way of calculating the uptime:
      # gawk “/cpu/ {print $1,($2 + $3 + $4 + $5)/(3600 * 24 * 100)}” /proc/stat
      cpu 38.6515
      cpu0 38.6515

      Confirmation of the previous measurement!

      # cat /proc/uptime
      45282758.17 663091.26

      The first number is the # of seconds since last boot. The other one (idle time) we don’t need. What is that in days?
      # gawk “{print $1/(3600 * 24.0)}” /proc/uptime
      524.106

      This is where the wrong data is coming from! So I’ll ignore this data.

      Remark: This server is one of my oldest ones and is still running Redhat 7.2 (Enigma). Looks like this bug was fixed in later versions of RedHat, since none of my other servers have it.

      Probe average cpu utilisation (MRTG)

      There are two main tools to keep track of your CPU usage: top and vmstat.

      • top is an interactive tool: it shows you the CPU usage of each process, as well as overall statistics, updated every 5 seconds. It’s good for hands-on checking.

        #top 17:18:34 up 2 days, 8:14, 3 users, load average: 0.00, 0.00, 0.00
        47 processes: 46 sleeping, 1 running, 0 zombie, 0 stopped
        CPU states: 0.1% user 0.1% system 0.0% nice 0.0% iowait 99.6% idle
        Mem: 1030872k av, 1022256k used, 8616k free,
        0k shrd, 104844k buff
        777088k actv, 12k in_d, 22296k in_c
        Swap: 2048276k av, 8120k used, 2040156k free
        640080k cached
        PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
        30776 root 19 0 1140 1140 852 R 0.9 0.1 0:00 0 top
        1 root 15 0 504 464 436 S 0.0 0.0 0:03 0 init (...)

        But say you want to get just one number (percentage) back, so you can use it for logging.
      • vmstat wil give you the following output:

        #vmstat
        procs memory swap io system cpu
        r b w swpd free buff cache si so bi bo in cs us sy id
        0 0 0 7964 8804 104712 640224 0 0 2 16 129 27 0 0 100

        You can run vmstat 1 5 to get 5 consecutive measurements (1 second apart). The number we want is the average CPU usage, or (100% – idle). The following command will do the job:
        #vmstat 1 5 | gawk "/0/ {tot=tot+1; id=id+$16} END {print 100 - id/tot}"
        gives
        0.4