Calculate hit rate from a log file

You have a huge file that contains one line per request/transaction. Some of the lines are of one type (e.g. ‘HIT’), some of another (e.g. MISS). Let’s say you want to calculate the hitrate, but as fast as possible.
We take a Squid log file of about 140MB. How long does it take to count how many lines it has?
# time wc -l /var/log/squid/access.log
845212 /var/log/squid/access.log
real 0m6.523s
(about 21.4 MB/s or 130.000 lines/s)

And now let’s just filter out the lines containing ‘HIT’ and count those:
#time sh -c "grep -i HIT /var/log/squid/access.log | wc -l"
Wow! This takes ages (I stopped it after 15 minutes) and the grep takes 100% CPU all the time. So let’s look for another solution.

Maybe gawk? First let’s see if it is much slower than wc -l for counting lines:
# time gawk "END {print NR}" /var/log/squid/access.log
845907
real 0m26.129s
(5.3 MB/s or 32.000 lines/s – 4 times slower)
And now let it count the hits too:
]# time gawk "BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}" '/var/log/squid/access.log'
84.5023
real 0m32.836s
(4MB/s or 25.000 lines/s – slow but acceptable)

Do we actually need a count on the whole file? What if we just took the last (i.e. most recent) 100.000 lines? The result would be a better indication of what the current hit rate is, and the speed of calculation would be more predictable.
# time sh -c "tail -100000 /var/log/squid/access.log | gawk 'BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}'"
92.305
real 0m3.332s
(30.000 lines/s)

It is actually a bit slower the first time you run it, probably due to disk or filesystem caching. So if you want your hit rate calculation to take less than 2 seconds, you could take the last 50.000 lines. Done!

Related posts:

  1. Estimate # of lines in a log file Let’s say you need an (approximate) count of the number...
  2. Squid: list top X referers If your Squid server logs the referers of its request...
  3. Probe disk performance (MRTG) The hdparam can be used to monitor the throughput speed...
  4. Date formatting in GAWK: boot time I have one server with apparently an exceptional stability: #...

1 Response to “Calculate hit rate from a log file”


  • thanx for the information, it is good article… can u explain me how to know the cpu service time for every hit.
    thanx b4.

Comments are currently closed.