Benchmark: lowercase conversion in bash

Post #2 in this bash benchmark series, measuring the speed of common bash text manipulations.

Update: I now have developed bash benchmarking for both throughput (MB/s) and invocation (ops/sec) speed in my project, combined with all kinds of other improvements, so the content in this article was updated [2022-04-08]

Converting text to lowercase

Bash benchmarks

Proper lowercase conversion handles latin characters, but also accénted letters and diaçritics.

Let’s go through some different methods:

using tr

The easiest to remember method, and the one you see promoted most in how-to sites like Stackexchange, is by using tr. The command syntax is quite simple and elegant:

tr "[:upper:]" "[:lower:]"

If we test this technique, we see the conversion is perfect:

Input:  'ŁORÈM ÎPSÙM DÔLÕR SIT AMÉT ŒßÞ' 
Output: 'łorèm îpsùm dôlõr sit amét œßþ'

using awk

This technique also has perfect lowercase conversion:

Input:  'ŁORÈM ÎPSÙM DÔLÕR SIT AMÉT ŒßÞ' 
Output: 'łorèm îpsùm dôlõr sit amét œßþ'

using sed

sed is very fast, but doesn’t know the ‘[:lower:]’-type macros that tr has. So you have to specify every uppercase letter and its lowercase counterpart:

upper="ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÄÆÃÅĀǍÇĆČÈÉÊËĒĖĘĚÎÏÍÍĪĮÌǏŁÑŃÔÖÒÓŒØŌǑÕẞŚŠÛÜǓÙǕǗǙǛÚŪŸŽŹŻ"
lower="abcdefghijklmnopqrstuvwxyzàáâäæãåāǎçćčèéêëēėęěîïííīįìǐłñńôöòóœøōǒõßśšûüǔùǖǘǚǜúūÿžźż"
sed "y/$upper/$lower/"

But this is only effective if your input text doesn’t include weird alphabets like the Greek (αβγδ), Cyrillic (абвг) or the Armenian (աբգդ) one. If it does, you would need to use a humongous list of funny characters and still not be sure if it’s complete. (e.g. äàâαáåąăãāǎա бբ çćčћцծչց δđðђдďդ éèêëεηęēėěеёэեէը ѓфֆ γгģգղ հ ιíîïīįìǐиıի йջ κќкķկք λłлļĺľլ μмմ ñνńнņňն öôοωóòøōǒõоոօ πпպփ ρрŕřռր σšśсս τтťթտ üùûúǔǖǘǚǜūуųů βвվ ўւ ξխ ÿýыüյ ζžźżзզժ)

So the accuracy depends on the completeness of the list above.

using ${} variable expansion

In my first version of this post I overlooked an super simple technique, that is built in to bash: ${variable,,} will convert the content of that value into lowercase. If we want to convert a whole file with this technique, we have to wrap it in a while/do loop:

while read -r line ; do
    echo ${line,,} 
done

This technique also has perfect conversion:

Input:  'ŁORÈM ÎPSÙM DÔLÕR SIT AMÉT ŒßÞ' 
Output: 'łorèm îpsùm dôlõr sit amét œßþ'

using php

If we can use awk or tr as an external program to do lowercase conversion, then why not php? Of course, you would first have to check if php is installed on your system (command -v php).This is what the code looks like in php:

php -r 'while($f = fgets(STDIN)){ print strtolower($f); }'

However, strtolower has a problem with the accented upper case characters:

Input:  'ŁORÈM ÎPSÙM DÔLÕR SIT AMÉT ŒßÞ' 
Output: 'ŁorÈm ÎpsÙm dÔlÕr sit amÉt ŒßÞ'

PHP has a multibyte/Unicode compatible alternative called mb_strtolower. This one does what we expect it to do.

Input:  'ŁORÈM ÎPSÙM DÔLÕR SIT AMÉT ŒßÞ' 
Output: 'łorèm îpsùm dôlõr sit amét œßþ'

using perl

And then there’s perl, which may be old skool/fashioned but is still installed by default on a lot of machines:

perl -ne 'print lc'

This simple code is also too simple, as it does not behave well for accented capital letters. To make the script Unicode compatible, one needs some adjustments. I admit, I had to look this up.

perl -CSA -ne 'use utf8; binmode STDOUT, ":utf8"; print lc'

Benchmark via pforret/bash_benchmarks

Now let’s see how all these methods compare in throughput speed (MB/s, when you start the command and let it process a big file in 1 go), and in invocation speed (operations/sec, which gives you an idea of the startup time a program needs.) Both are of the ‘more-is-better’ type.

I will focus here on the relative speeds compared to each other, the absolute speeds depend on your machine, and my 2021 MacBookPro M1 16” is quite fast. I’ve tested these benchmarks on a Ubuntu-on-Windows WSL1 environment, and that is wayyyyyy slower.

method throughput invocation
awk 98 MB/s 256 ops/sec
perl (print lc) 645 MB/s (!) 356 ops/sec
perl (Unicode) 98 MB/s 341 ops/sec
php (strtolower) 249 MB/s 61 ops/sec
php (Unicode) 73 MB/s 61 ops/sec
sed 241 MB/S 909 ops/sec
tr 25 MB/s 844 ops/sec
${line,,} 9 MB/s 9091 ops/sec (!)

Some lessons from these benchmarks:

So what is my recommendation for lowercase conversion in bash?

More info: pforret.github.io/bash_benchmarks/lowercase.html

💬 bash 🏷 benchmark 🏷 sed 🏷 tr 🏷 perl 🏷 php 🏷 awk 🏷 shell 🏷 bash-benchmark