You have a huge file that contains one line per request/transaction. Some of the lines are of one type (e.g. ‘HIT’), some of another (e.g. MISS). Let’s say you want to calculate the hitrate, but as fast as possible.
We take a Squid log file of about 140MB. How long does it take to count how many lines it has?
# time wc -l /var/log/squid/access.log (about 21.4 MB/s or 130.000 lines/s)
845212 /var/log/squid/access.log
real 0m6.523s
And now let’s just filter out the lines containing ‘HIT’ and count those:
#time sh -c "grep -i HIT /var/log/squid/access.log | wc -l"
Wow! This takes ages (I stopped it after 15 minutes) and the grep takes 100% CPU all the time. So let’s look for another solution.
Maybe gawk? First let’s see if it is much slower than wc -l for counting lines:
# time gawk "END {print NR}" /var/log/squid/access.log (5.3 MB/s or 32.000 lines/s – 4 times slower)
845907
real 0m26.129s
And now let it count the hits too:
]# time gawk "BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}" '/var/log/squid/access.log' (4MB/s or 25.000 lines/s – slow but acceptable)
84.5023
real 0m32.836s
Do we actually need a count on the whole file? What if we just took the last (i.e. most recent) 100.000 lines? The result would be a better indication of what the current hit rate is, and the speed of calculation would be more predictable.
# time sh -c "tail -100000 /var/log/squid/access.log | gawk 'BEGIN {hit=0} /HIT/ {hit = hit+1} END {print hit/NR*100}'" (30.000 lines/s)
92.305
real 0m3.332s
It is actually a bit slower the first time you run it, probably due to disk or filesystem caching. So if you want your hit rate calculation to take less than 2 seconds, you could take the last 50.000 lines. Done!
If you're new here, you may want to subscribe to my RSS feed or receive updates via email. Thanks for visiting!
Related posts:
- Estimate # of lines in a log file Let’s say you need an (approximate) count of the number...
- Convert Bind DNS zone into PTR records The following script I made in order to convert the...
- Probe disk performance (MRTG) The hdparam can be used to monitor the throughput speed...
- Squid: list top X referers If your Squid server logs the referers of its request...
- Date formatting in GAWK: boot time I have one server with apparently an exceptional stability: #...




thanx for the information, it is good article… can u explain me how to know the cpu service time for every hit.
thanx b4.