Calculate hit rate from a log file

You have a huge file that contains one line per request/transaction. Some of the lines are of one type (e.g. ‘HIT’), some of another (e.g. MISS). Let’s say you want to calculate the hitrate, but as fast as possible.
We take a Squid log file of about 140MB. How long does it take to count how many lines it has?
# time wc -l /var/log/squid/access.log
845212 /var/log/squid/access.log
real 0m6.523s
(about 21.4 MB/s or 130.000 lines/s)

And now let’s just filter out the lines containing ‘HIT’ and count those:
#time sh -c "grep -i HIT /var/log/squid/access.log | wc -l"
Wow! This takes ages (I stopped it after 15 minutes) and the grep takes 100% CPU all the time. So let’s look for another solution.

Maybe gawk? First let’s see if it is much slower than wc -l for counting lines:
# time gawk "END {print NR}" /var/log/squid/access.log
845907
real 0m26.129s
(5.3 MB/s or 32.000 lines/s – 4 times slower)
And now let it count the hits too:
]# time gawk "BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}" '/var/log/squid/access.log'
84.5023
real 0m32.836s
(4MB/s or 25.000 lines/s – slow but acceptable)

Do we actually need a count on the whole file? What if we just took the last (i.e. most recent) 100.000 lines? The result would be a better indication of what the current hit rate is, and the speed of calculation would be more predictable.
# time sh -c "tail -100000 /var/log/squid/access.log | gawk 'BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}'"
92.305
real 0m3.332s
(30.000 lines/s)

It is actually a bit slower the first time you run it, probably due to disk or filesystem caching. So if you want your hit rate calculation to take less than 2 seconds, you could take the last 50.000 lines. Done!

  • del.icio.us
  • digg
  • Reddit
  • Facebook
  • FriendFeed

If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!

Related posts:

  1. Estimate # of lines in a log file Let’s say you need an (approximate) count of the number...
  2. Convert Bind DNS zone into PTR records The following script I made in order to convert the...
  3. Probe disk performance (MRTG) The hdparam can be used to monitor the throughput speed...
  4. Squid: list top X referers If your Squid server logs the referers of its request...
  5. Date formatting in GAWK: boot time I have one server with apparently an exceptional stability: #...

1 Responses to “Calculate hit rate from a log file”


  • thanx for the information, it is good article… can u explain me how to know the cpu service time for every hit.
    thanx b4.

Leave a Reply