Calculate hit rate from a log file

You have a huge file that contains one line per request/transaction. Some of the lines are of one type (e.g. ‘HIT’), some of another (e.g. MISS). Let’s say you want to calculate the hitrate, but as fast as possible.
We take a Squid log file of about 140MB. How long does it take to count how many lines it has?
# time wc -l /var/log/squid/access.log<br /> 845212 /var/log/squid/access.log<br /> real 0m6.523s (about 21.4 MB/s or 130.000 lines/s)

And now let’s just filter out the lines containing ‘HIT’ and count those:
#time sh -c "grep -i HIT /var/log/squid/access.log | wc -l"
Wow! This takes ages (I stopped it after 15 minutes) and the grep takes 100% CPU all the time. So let’s look for another solution.

Maybe gawk? First let’s see if it is much slower than wc -l for counting lines:
# time gawk "END {print NR}" /var/log/squid/access.log<br /> 845907<br /> real 0m26.129s (5.3 MB/s or 32.000 lines/s – 4 times slower)
And now let it count the hits too:
]# time gawk "BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}" '/var/log/squid/access.log'<br /> 84.5023<br /> real 0m32.836s (4MB/s or 25.000 lines/s – slow but acceptable)

Do we actually need a count on the whole file? What if we just took the last (i.e. most recent) 100.000 lines? The result would be a better indication of what the current hit rate is, and the speed of calculation would be more predictable.
# time sh -c "tail -100000 /var/log/squid/access.log | gawk 'BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}'"<br /> 92.305<br /> real 0m3.332s (30.000 lines/s)

It is actually a bit slower the first time you run it, probably due to disk or filesystem caching. So if you want your hit rate calculation to take less than 2 seconds, you could take the last 50.000 lines. Done!

💬 bash 🏷 benchmark