Benchmark: romanization in bash

08 Apr 2022

Post #4 in this bash benchmark series, measuring the speed of common bash text manipulations.

Romanization: removing accents/diacritics

Bash benchmarks

Romanization is the converting of non-roman characters (with accents, diacritics or alternative alphabets) into the latin alphabet. It is a kind of transliteration (conversion between alphabets).

So é becomes e, ç becomes c, ô becomes o. But also б (greek beta) becomes b, σ (sigma) becomes s, д (russian) becomes d, յ (armenian yi) becomes y.

And of course things are never that easy. Some non-latin characters are romanized / transliterated to two characters, like œ -> oe, æ -> ae, я -> ya. This makes it hard to accomplish exhaustive romanization with tr or sed.

using `awk`

If we include roman languages (éèà), slavic (ăž), cyrillic (жзи), turkish (ğış) and armenian (կհձ), we can construct the following daunting awk program:

awk '{ gsub(/[ğЪЬъь]/,""); gsub(/[ÀÁÂÃÄÅĀĂĄǍΑԱ]/,"A"); gsub(/[Æ]/,"AE"); 
  gsub(/[БԲ]/,"B"); gsub(/[ÇĆČЦԾՉՑ]/,"C"); gsub(/[ČΧЧՃ]/,"CH"); 
  gsub(/[ÐĎΔДԴ]/,"D"); gsub(/[ЏՁ]/,"DZ"); gsub(/[ÈÉÊËĒĖĘĚΕΗЁЕЭԵԷԸ]/,"E"); 
  gsub(/[ԵՒ]/,"EW"); gsub(/[ЃФՖ]/,"F"); gsub(/[ĢΓГԳՂ]/,"G"); 
  gsub(/[Հ]/,"H"); gsub(/[ÌÍÎÏĪĮǏΙИԻ]/,"I"); gsub(/[ЙՋ]/,"J"); 
  gsub(/[ĶΚЌКԿՔ]/,"K"); gsub(/[Х]/,"KH"); gsub(/[ĻŁΛЛԼ]/,"L"); 
  gsub(/[ΜМՄ]/,"M"); gsub(/[ÑŅŇΝНՆ]/,"N"); gsub(/[ÒÓÔÕÖØŌǑΟΩОՈՕ]/,"O"); 
  gsub(/[ØŒ]/,"OE"); gsub(/[ΠПՊՓ]/,"P"); gsub(/[Φ]/,"PH"); 
  gsub(/[Ψ]/,"PS"); gsub(/[ŘΡРՌՐ]/,"R"); gsub(/[ŠΣСՍ]/,"S"); 
  gsub(/[Щ]/,"SCH"); gsub(/[ŠȘШՇ]/,"SH"); gsub(/[ẞ]/,"SS"); 
  gsub(/[ŤΤТԹՏ]/,"T"); gsub(/[ÞΘ]/,"TH"); gsub(/[Ț]/,"TS"); 
  gsub(/[ÙÚÛÜŪŮŲǓǕǗǙǛУՈՒ]/,"U"); gsub(/[ΒВՎ]/,"V"); gsub(/[ЎՒ]/,"W"); 
  gsub(/[ΞԽ]/,"X"); gsub(/[ÝŸЫՅ]/,"Y"); gsub(/[Я]/,"YA"); 
  gsub(/[Ю]/,"YU"); gsub(/[ŹŻŽΖЗԶԺ]/,"Z"); gsub(/[ŽЖ]/,"ZH"); 
  gsub(/[àáâãäåāăąǎαա]/,"a"); gsub(/[æ]/,"ae"); gsub(/[бբ]/,"b"); 
  gsub(/[çćčцћծչց]/,"c"); gsub(/[čχчճ]/,"ch"); gsub(/[ðďđδдђդ]/,"d"); 
  gsub(/[џձ]/,"dz"); gsub(/[èéêëēėęěεηеэёեէը]/,"e"); gsub(/[և]/,"ew"); 
  gsub(/[фѓֆ]/,"f"); gsub(/[ģγгգղ]/,"g"); gsub(/[հ]/,"h"); 
  gsub(/[ìíîïīįıǐιиի]/,"i"); gsub(/[йջ]/,"j"); gsub(/[ķκкќկք]/,"k"); 
  gsub(/[х]/,"kh"); gsub(/[ĺļľłλлլ]/,"l"); gsub(/[љ]/,"lj"); 
  gsub(/[μмմ]/,"m"); gsub(/[ñńņňνнն]/,"n"); gsub(/[њ]/,"nj"); 
  gsub(/[òóôõöøōǒοωоոօ]/,"o"); gsub(/[øœ]/,"oe"); gsub(/[πпպփ]/,"p"); 
  gsub(/[φ]/,"ph"); gsub(/[ψ]/,"ps"); gsub(/[ŕřρрռր]/,"r"); 
  gsub(/[śšσсս]/,"s"); gsub(/[щ]/,"sch"); gsub(/[şšșшշ]/,"sh"); 
  gsub(/[ß]/,"ss"); gsub(/[ťτтթտ]/,"t"); gsub(/[þθ]/,"th"); 
  gsub(/[čț]/,"ts"); gsub(/[ùúûüūůųǔǖǘǚǜуու]/,"u"); gsub(/[βвվ]/,"v"); 
  gsub(/[ўւ]/,"w"); gsub(/[ξխ]/,"x"); gsub(/[üýÿыյ]/,"y"); 
  gsub(/[я]/,"ya"); gsub(/[ю]/,"yu"); gsub(/[źżžζзզժ]/,"z"); 
  gsub(/[žж]/,"zh"); print $0; }'

It does a pretty good job of romanization:

Command: 'awk { gsub(/[ğЪЬъь]/,""); gsub(/[ÀÁÂÃÄÅĀĂĄǍΑԱ]/,"A"); gsub(/...'
Before: 'ŁORÈM ÎPSÙM dôlõr sit amét œßþ'
After : 'LOREM IPSUM dolor sit amet oessth'

using `iconv`

You might find this on Google as a solution for this problem, but as you will see, it maps most characters to combinations of ~’¨^` with a letter. E.g. the character Ä is mapped to ¨A, which is maybe not what you want.

Command: 'iconv -f utf-8 -t ascii//TRANSLIT'
Before: 'ŁORÈM ÎPSÙM dôlõr sit amét œßþ'
After : 'LOR`EM ^IPS`UM d^ol~or sit am'et oessth'

using `sed`

If we use the same technique as we did for the lower case conversion (sed 'y/___/___/', we can only translate to single characters.

Command:
from="ÄÀÂΑÁÅĂÃĀǍĄԱБԲÇĆČЦԾՉՑΔÐДĎԴÉÈÊËΕΗĒĖĘĚЕЁЭԵԷԸЃФՖΓГĢԳՂՀΙÍÎÏĪĮÌǏИԻЙՋΚЌКĶԿՔΛŁЛĻԼΜМՄÑΝНŅŇՆÖÔΟΩÓÒØŌǑÕОՈՕΠПՊՓΡРŘՌՐΣСŠՍΤТŤԹՏÜÙÛÚǓǕǗǙǛŪУŲŮΒВՎЎՒΞԽŸÝЫՅΖŽŹŻЗԶԺäàâαáåąăãāǎաбբçćčћцծչցδđðђдďդéèêëεηęēėěеёэեէըѓфֆγгģգղհιíîïīįìǐиıիйջκќкķկքλłлļĺľլμмմñνńнņňնöôοωóòøōǒõоոօπпպփρрŕřռրσšśсսτтťթտüùûúǔǖǘǚǜūуųůβвվўւξխÿýыüյζžźżзզժ"
to="AAAAAAAAAAAABBCCCCCCCDDDDDEEEEEEEEEEEEEEEEFFFGGGGGHIIIIIIIIIIJJKKKKKKLLLLLMMMNNNNNNOOOOOOOOOOOOOPPPPRRRRRSSSSTTTTTUUUUUUUUUUUUUVVVWWXXYYYYZZZZZZZaaaaaaaaaaaabbccccccccdddddddeeeeeeeeeeeeeeeefffggggghiiiiiiiiiiijjkkkkkklllllllmmmnnnnnnnooooooooooooopppprrrrrrssssstttttuuuuuuuuuuuuuvvvwwxxyyyyyzzzzzzz"
sed "y/$from/$to/"
Before: 'ŁORÈM ÎPSÙM dôlõr sit amét œßþ'
After : 'LOREM IPSUM dolor sit amet œßþ' ## the œ is not romanized

However, we can use sed in a way that resembles the strategy we took for awk:

Command:
sed -e 's/[ğЪЬъь]//g' -e 's/[ÀÁÂÃÄÅĀĂĄǍΑԱ]/A/g' -e 's/[Æ]/AE/g' -e 's/[БԲ]/B/g' 
    -e 's/[ÇĆČЦԾՉՑ]/C/g' -e 's/[ČΧЧՃ]/CH/g' -e 's/[ÐĎΔДԴ]/D/g' -e 's/[ЏՁ]/DZ/g' 
    -e 's/[ÈÉÊËĒĖĘĚΕΗЁЕЭԵԷԸ]/E/g' -e 's/[ԵՒ]/EW/g' -e 's/[ЃФՖ]/F/g' 
    -e 's/[ĢΓГԳՂ]/G/g' -e 's/[Հ]/H/g' -e 's/[ÌÍÎÏĪĮǏΙИԻ]/I/g' -e 's/[ЙՋ]/J/g' 
    -e 's/[ĶΚЌКԿՔ]/K/g' -e 's/[Х]/KH/g' -e 's/[ĻŁΛЛԼ]/L/g' -e 's/[ΜМՄ]/M/g' 
    -e 's/[ÑŅŇΝНՆ]/N/g' -e 's/[ÒÓÔÕÖØŌǑΟΩОՈՕ]/O/g' -e 's/[ØŒ]/OE/g' 
    -e 's/[ΠПՊՓ]/P/g' -e 's/[Φ]/PH/g' -e 's/[Ψ]/PS/g' -e 's/[ŘΡРՌՐ]/R/g' 
    -e 's/[ŠΣСՍ]/S/g' -e 's/[Щ]/SCH/g' -e 's/[ŠȘШՇ]/SH/g' -e 's/[ẞ]/SS/g' 
    -e 's/[ŤΤТԹՏ]/T/g' -e 's/[ÞΘ]/TH/g' -e 's/[Ț]/TS/g' -e 's/[ÙÚÛÜŪŮŲǓǕǗǙǛУՈՒ]/U/g' 
    -e 's/[ΒВՎ]/V/g' -e 's/[ЎՒ]/W/g' -e 's/[ΞԽ]/X/g' -e 's/[ÝŸЫՅ]/Y/g' 
    -e 's/[Я]/YA/g' -e 's/[Ю]/YU/g' -e 's/[ŹŻŽΖЗԶԺ]/Z/g' -e 's/[ŽЖ]/ZH/g' 
    -e 's/[àáâãäåāăąǎαա]/a/g' -e 's/[æ]/ae/g' -e 's/[бբ]/b/g' -e 's/[çćčцћծչց]/c/g' 
    -e 's/[čχчճ]/ch/g' -e 's/[ðďđδдђդ]/d/g' -e 's/[џձ]/dz/g' 
    -e 's/[èéêëēėęěεηеэёեէը]/e/g' -e 's/[և]/ew/g' -e 's/[фѓֆ]/f/g' 
    -e 's/[ģγгգղ]/g/g' -e 's/[հ]/h/g' -e 's/[ìíîïīįıǐιиի]/i/g' -e 's/[йջ]/j/g' 
    -e 's/[ķκкќկք]/k/g' -e 's/[х]/kh/g' -e 's/[ĺļľłλлլ]/l/g' -e 's/[љ]/lj/g' 
    -e 's/[μмմ]/m/g' -e 's/[ñńņňνнն]/n/g' -e 's/[њ]/nj/g' -e 's/[òóôõöøōǒοωоոօ]/o/g' 
    -e 's/[øœ]/oe/g' -e 's/[πпպփ]/p/g' -e 's/[φ]/ph/g' -e 's/[ψ]/ps/g' 
    -e 's/[ŕřρрռր]/r/g' -e 's/[śšσсս]/s/g' -e 's/[щ]/sch/g' -e 's/[şšșшշ]/sh/g' 
    -e 's/[ß]/ss/g' -e 's/[ťτтթտ]/t/g' -e 's/[þθ]/th/g' -e 's/[čț]/ts/g' 
    -e 's/[ùúûüūůųǔǖǘǚǜуու]/u/g' -e 's/[βвվ]/v/g' -e 's/[ўւ]/w/g' -e 's/[ξխ]/x/g' 
    -e 's/[üýÿыյ]/y/g' -e 's/[я]/ya/g' -e 's/[ю]/yu/g' -e 's/[źżžζзզժ]/z/g' 
    -e 's/[žж]/zh/g'
Before: 'ŁORÈM ÎPSÙM dôlõr sit amét œßþ'
After : 'LOREM IPSUM dolor sit amet oessth'

using `tr`

tr will only allow us to replace single characters with other single characters, and there’s no trick around that. If your use case can live with that (e.g. only french accents and diacritics), then this method is OK.

Command:
tr
  'ÄÀÂΑÁÅĂÃĀǍĄԱБԲÇĆČЦԾՉՑΔÐДĎԴÉÈÊËΕΗĒĖĘĚЕЁЭԵԷԸЃФՖΓГĢԳՂՀΙÍÎÏĪĮÌǏИԻЙՋΚЌКĶԿՔΛŁЛĻԼΜМՄÑΝНŅŇՆÖÔΟΩÓÒØŌǑÕОՈՕΠПՊՓΡРŘՌՐΣСŠՍΤТŤԹՏÜÙÛÚǓǕǗǙǛŪУŲŮΒВՎЎՒΞԽŸÝЫՅΖŽŹŻЗԶԺäàâαáåąăãāǎաбբçćčћцծչցδđðђдďդéèêëεηęēėěеёэեէըѓфֆγгģգղհιíîïīįìǐиıիйջκќкķկքλłлļĺľլμмմñνńнņňնöôοωóòøōǒõоոօπпպփρрŕřռրσšśсսτтťթտüùûúǔǖǘǚǜūуųůβвվўւξխÿýыüյζžźżзզժ' \n
  'AAAAAAAAAAAABBCCCCCCCDDDDDEEEEEEEEEEEEEEEEFFFGGGGGHIIIIIIIIIIJJKKKKKKLLLLLMMMNNNNNNOOOOOOOOOOOOOPPPPRRRRRSSSSTTTTTUUUUUUUUUUUUUVVVWWXXYYYYZZZZZZZaaaaaaaaaaaabbccccccccdddddddeeeeeeeeeeeeeeeefffggggghiiiiiiiiiiijjkkkkkklllllllmmmnnnnnnnooooooooooooopppprrrrrrssssstttttuuuuuuuuuuuuuvvvwwxxyyyyyzzzzzzz'
Before: 'ŁORÈM ÎPSÙM dôlõr sit amét œßþ'
After : 'LOREM IPSUM dolor sit amet œßþ'

using uni2ascii

I have also found a specialized transliteration program for Linux: uni2asccii. However, I can’t get it to romanize correctly (might be a MacOS problem, haven’t tested it on Ubuntu yet).

Command: 'uni2ascii -B'
Before: 'ŁORÈM ÎPSÙM dôlõr sit amét œßþ'
After : '0x0141OREM IPSUM dolor sit amet oessy' ## problem with Ł and þ

Benchmark via pforret/bash_benchmarks

I will focus here on the relative speeds compared to each other, the absolute speeds depend on your machine, and my 2021 MacBookPro M1 16” is quite fast. I’ve tested these benchmarks on a Ubuntu-on-Windows WSL1 environment, and that is wayyyyyy slower.

method	throughput	invocation
awk (multichar)	11 MB/s	204 ops/sec
sed (1 char only)	240 MB/S	926 ops/sec
sed (multi-char)	1 MB/S	601 ops/sec
tr (1 char only)	25 MB/s	899 op/sec
iconv	146 MB/s	989 ops/sec
uni2ascii	2 MB/sec	764 ops/sec

Some lessons from these benchmarks:

awk is a little slower to start up with such a big set of instructions (200 ops/sec vs normally 250 ops/sec)
sed with 85 small instructions runs dramatically slower than sed with 1 large instruction (1 MB/s vs 240 MB/s)
iconv could maybe be fixed if you would delete all the ~’¨^` characters afterwards.

So what is my recommendation for romanizing text?

if you know all the accents, diacritics or alternative alphabets you need to support:
- if all your characters can be replaced by a single character: use sed
- if some of your characters should be transliterated to 2 characters, use awk
if you don’t know up front what types of input you need to support: use awk with a huge set of rules.