Calculate hit rate from a log file

You have a huge file that contains one line per request/transaction. Some lines are of one type (e.g. ‘HIT’), some of another (e.g. MISS). Let’s say you want to calculate the hit-rate, but as fast as possible.
We take a Squid log file of about 140MB. How long does it take to count how many lines it has?

$ time wc -l /var/log/squid/access.log
845212 /var/log/squid/access.log
real 0m6.523s

(about 21.4 MB/s or 130.000 lines/s)

And now let’s just filter out the lines containing ‘HIT’ and count those:

$ time sh -c "grep -i HIT /var/log/squid/access.log | wc -l"

Wow! This takes ages (I stopped it after 15 minutes) and the grep takes 100% CPU all the time. So let’s look for another solution.

Maybe gawk? First let’s see if it is much slower than wc -l for counting lines:

$ time gawk "END {print NR}" /var/log/squid/access.log
real 0m26.129s

(5.3 MB/s or 32.000 lines/s – 4 times slower)

And now let it count the hits too:

$ time gawk "BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}" '/var/log/squid/access.log'
real 0m32.836s

(4MB/s or 25.000 lines/s – slow but acceptable)

Do we actually need a count on the whole file? What if we just took the last (i.e. most recent) 100.000 lines? The result would be a better indication of what the current hit rate is, and the speed of calculation would be more predictable.

$ time sh -c "tail -100000 /var/log/squid/access.log | gawk 'BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}'"
real 0m3.332s

(30.000 lines/s)

It is actually a bit slower the first time you run it, probably due to disk or filesystem caching. So if you want your hit rate calculation to take less than 2 seconds, you could take the last 50.000 lines. Done!

💬 bash 🏷 benchmark