Archive for the 'Linux' Category Page 2 of 2



Estimate # of lines in a log file

Let’s say you need an (approximate) count of the number of lines in a huge file. The most obvious way of calculating this would be using wc, but this actually can be quite slow:
# time wc -l /var/log/squid/access.log
2812824 /var/log/squid/access.log
real 0m43.988s

(counting is done at 64.000 lines/sec)

Running wc without the -l (only count lines) would be ever slower because it would also count the words, instead of just the LF (linefeed) characters. But using wc -c is very fast! This is because the filesystem keeps track of each file’s filesize (= number of characters/bytes), so the file does not even have to be read to give this number. Can we estimate the # of lines from the # of bytes?

For the type of file we are talking about here (a Squid log file) there actually is a way. The file is more or less ’square’, meaning that every line is about the same length (it contains date, status, URL, …).
If we take the beginning of the file (the first 10000 lines):
# head -10000 /var/log/squid/access.log | wc
10000 100000 1775257

we see that every line is about 177 chars long.

The end of the file (the last 10000 lines):
# tail -10000 /var/log/squid/access.log | wc
10000 100000 2047887

gives us a number of 204 chars/line.

Let’s take some more data and combine both:
# ( head -50000 /var/log/squid/access.log ; tail -50000 /var/log/squid/access.log ) | wc
100000 1000000 19488905

which gives us an average of 195 chars/line.

A file size of 533.229.920 bytes (533MB) would lead us to estimate the # of lines to 2.734.512, where the actual # of lines is 2.818.184 (3% difference). That is: we lose 3% accuracy but the calculation takes almost no CPU time, instead of 45 seconds. This might be a trade-off you are willing to accept!

If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!

Calculate hit rate from a log file

You have a huge file that contains one line per request/transaction. Some of the lines are of one type (e.g. ‘HIT’), some of another (e.g. MISS). Let’s say you want to calculate the hitrate, but as fast as possible.
We take a Squid log file of about 140MB. How long does it take to count how many lines it has?
# time wc -l /var/log/squid/access.log
845212 /var/log/squid/access.log
real 0m6.523s
(about 21.4 MB/s or 130.000 lines/s)

And now let’s just filter out the lines containing ‘HIT’ and count those:
#time sh -c "grep -i HIT /var/log/squid/access.log | wc -l"
Wow! This takes ages (I stopped it after 15 minutes) and the grep takes 100% CPU all the time. So let’s look for another solution.

Maybe gawk? First let’s see if it is much slower than wc -l for counting lines:
# time gawk "END {print NR}" /var/log/squid/access.log
845907
real 0m26.129s
(5.3 MB/s or 32.000 lines/s – 4 times slower)
And now let it count the hits too:
]# time gawk "BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}" '/var/log/squid/access.log'
84.5023
real 0m32.836s
(4MB/s or 25.000 lines/s – slow but acceptable)

Do we actually need a count on the whole file? What if we just took the last (i.e. most recent) 100.000 lines? The result would be a better indication of what the current hit rate is, and the speed of calculation would be more predictable.
# time sh -c "tail -100000 /var/log/squid/access.log | gawk 'BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}'"
92.305
real 0m3.332s
(30.000 lines/s)

It is actually a bit slower the first time you run it, probably due to disk or filesystem caching. So if you want your hit rate calculation to take less than 2 seconds, you could take the last 50.000 lines. Done!

Squid: list top X referers

If your Squid server logs the referers of its request (i.e.
1. you’ve configured squid-cache with --enable-referer-log before compiling and
2. you’ve included a referer_log /var/log/squid/referer.log in your squid.conf file),
you can easily show top 50 of most popular referers with a simple Bourne shell:
#!/bin/bash

  1. this script is ‘top_referers.sh’
  2. (c) 2004 Peter Forret - Open Source
    REFERERS=/var/log/squid/referer.log
    OUTPUT=/var/www/html/stats/referer.txt
    MAXLINES=50(
    echo REPORT MADE AT `date`
    echo =============================
    $OUTPUT

Then add it to your crontab:
10 * * * * /(path)/top_referers.sh
and you have an hourly updated stat!
Add a little HTML formatting if you’re aesthetically demanding!

Redhat versions: what am I running?

If you manage multiple RedHat servers, or if you just stumble on a Linux server, and you have no idea what kind of machine it is, nor what the version of the OS is, try the following commands:

# more /proc/version
Linux version 2.4.20-24.9 (bhcompile@porky.devel.redhat.com)
(gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)) #1
Mon Dec 1 11:35:51 EST 2003
# more /proc/cpuinfo
vendor_id : GenuineIntel
model name : Intel(R) Pentium(R) 4 CPU 2.00GHz
cpu MHz : 1992.653
cache size : 512 KB
(…)
bogomips : 3971.48
# more /proc/meminfo
MemTotal: 1030872 kB
(…)
# cat /etc/redhat-release (only for RedHat distributions)
Red Hat Linux release 9 (Shrike)

So now you know: a 2GHz Pentium 4 with 1GB of memory, running RedHat 9 ‘Shrike’.
For more info on RedHat versions: Taroon, Shrike, Enigma, …