Benchmark: slugify text in Bash

Post #5 in this bash benchmark series, measuring the speed of common bash text manipulations.

Slugify: clean up text to use as a filename/url

This consists of the following 3 operations:

It is even better to precede this with romanization (é => e) and conversion to lowercase, but I’m now just going to talk about the pure slugification.

Bash benchmarks

using awk

This is of course something that is easy to do in awk. It is just a sequence of gsub regular expression replaces.

Command: awk '{
              gsub(/[^0-9a-zA-Z .-]/,""); 
              gsub(/^[ \t\r\n]+/, ""); 
              gsub(/[ \t\r\n]+$/, ""); 
              gsub(/[ ]+/,"-"); 
Before: '  (Demain, dès l’aube)     '
After : 'Demain-ds-laube'

using sed

Regular expression replacement is what sed does very well too, and probably faster.

I first tried this

Command: sed -e 's/[^0-9a-zA-Z .-]*//g' -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' -e 's/  */-/g'
Before: '  (Demain, dès l’aube)     '
After : 'Demain-ds-laube'

and then, by just deleting 1 character, made it better. Can you spot the character? Check the speeds for both methods in the next chapter.

Command: sed -e 's/[^0-9a-zA-Z .-]//g' -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' -e 's/  */-/g'
Before: '  (Demain, dès l’aube)     '
After : 'Demain-ds-laube'

using tr

It is possible to do almost slugify with tr, but not entirely. The modifier -s can be used to indicate that groups of non-allowed characters should only be replaced by a single character, but there remains an issue with leading/trailing whitespace. Again, this could be solved by adding a 2nd piped tr statement. It’s just that, in this series, I try to do every task in just 1 statement.

Command: tr -cs '[:alnum:].-' '-'
Before: '  (Demain, dès l’aube)     '
After : '-Demain-d-s-l-aube-'

using ${} variable expansion

As we have seen in previous benchmarks, it is interesting to also try to do everything in one (long) ${} statement, since these typically have an unbeatable invocation speed. This was my first trial:

Command: ${line//[^a-zA-Z0-9]/-}
Before: '  (Demain, dès l’aube)     '
After : '---Demain--d--s-l-aube------'

That’s a lot of extra hyphens we still need to get rid of. Let’s try a series of ${} to replace characters and take care of leading/trailing whitespace. Our code readability drops dramatically.

Command: $(line="${line//[^a-zA-Z0-9 ]/}"; 
            echo "${line// /-}")
Before: '  (Demain, dès l’aube)     '
After : 'Demain-ds-laube'

Benchmark via pforret/bash_benchmarks

I will focus here on the relative speeds compared to each other, the absolute speeds depend on your machine, and my 2021 MacBookPro M1 16” is quite fast. I’ve tested these benchmarks on a Ubuntu-on-Windows WSL1 environment, and that is wayyyyyy slower.

method throughput invocation
awk 95 MB/s (!) 268 ops/sec
sed /[]*/ 7 MB/S 1021 ops/sec
sed /[]/ 18 MB/S 1022 ops/sec (!)
tr 27 MB/s 1063 ops/sec
${} (1x) 8 MB/S 9174 ops/sec
${} (4x) 8 MB/s 1869 ops/sec (!!)

Some lessons from these benchmarks:

So what is my recommendation for slugification?

💬 bash 🏷 benchmark 🏷 sed 🏷 tr 🏷 awk 🏷 slugify 🏷 bash-benchmark