Database war stories: DB vs _square_ files

Plug and PlayI’ve been following the Database War Stories of O’Reilly Radar: how companies use text-based alternatives to classic relational database systems in order to cope with huge volumes. Check out the stories of Findory/Amazon, Google File System, Flickr and Second Life. Anyway, this seemed like a good moment to share some of my database war stories. Let me take you back to the early nineties.

1993 @ Ukkel
I arrive at Sopres, one of the larger direct marketing / database management companies in Belgium. Fresh from university (and 1 year of military service), I expect to see RDBMS everywhere and dive into SQL. Imagine my surprise when I see that, yes, there are a lot of Sybase SQLServer databases around, but the bulk of the work is done with something they call ‘square files’ (see below). They have built a whole set of tools to work with those and by using them myself, I learn to appreciate the advantanges of the system (speed, mainly) and grow a fairly accurate intuition for things like queries, indexes and outer joins.

Square files
What is a square file – or as we called it in French: “fichier carré“? It is a plain-text file with fixed record length. It looks square (actually: rather rectangular) when you open it in a text editor. E.g. a file XYZ2006.288, containing 20.000 customer records of each 288 bytes, would be exactly 5.760.000 bytes (5,76 MB). If the file was called XYZ2006.customers, each program that processed such a file would look for a ‘descriptor’ file in the same folder customers.d that would not only indicate the record length (288 bytes) but also define the fields within a record (e.g. fullname: 40 bytes, address: 60 bytes, postal code: 4 bytes, …). To make the square file easy editable, the last field could be “LF: 1 byte” that contains a line feed, so that each record is on a new line.
The advantage of this system:

The sp-tools
Looks like a bug
Through the years they has also developed a set of tools to work with square files. They all started with ‘sp’, hence the “sp-tools”. I quickly realised they had an alternative for each SQL statement you could make:

The most interesting tool was sp-match, that was used to do joins: merge the contents of two files by matching on certain fields. E.g. you have a table of purchases (customer_id, date, amount) and you want to add the postal-code from the customers file: you match on customer_id and get (customer_id,date_amount,postal-code) as output. Using this tool and its variants (sorted/non-sorted match table, inner/outer/left outer) really taught me how joins work, what has to be sorted (indexed), what has to be in memory or on disk, and that is still the basis of my intuitive SQL optimisation skills. We basically did by hand all the stuff that SQL does behind the scenes.

Disk space
Watch that tape spin
We were working on Sun ‘dumb’ workstations and on our servers we had an enormous disk capacity at that time: several GIGABYTES! Hard disk sizes were still measured in megabytes back then. I remember a discussion on whether to buy one ‘huge’ 500MB disk or rather ten 50MB disks, because the last option would be more expensive, but faster (divided over multiple disk controllers).
A typical project required up to 100MB storage space (*gasp*) so all temporary files had to be cleaned up ASAP. That’s why there was a system of max-age suffixes: a file that was called would be automatically deleted after 5 days. “.2t” after 2 days. If you left large files sitting around without .Nt suffix, you got spanked when they detected it in the weekly storage report.
The output of our work would be delivered on floppy disks, if it was smaller than 1.4MB or else on tape. I’m not talking DLT or DAT here, I mean tapes on spindles, the real thing. No email, no USB-sticks, no ZIP-drives, no CD-ROMs. Those were the days.

💬 bandwidth 🏷 database 🏷 sopres 🏷 text 🏷 unix 🏷 bandwidth