How do you move a terabyte?


I recently discovered Brewster Kahle’s speech on the NotCon ’04 podcast about the ambition of The Internet Archive to archive absolutely everything (all books, all movies, all music, …). (There is an excellent transcript on www.hotales.org .) They are currently setting up a second datacentre in Amsterdam, as an off-site copy of the original archive.org. They use massive parallel storage nodes grouped together in a PetaBox rack. You actually need 10 Petaboxes to get to 1 Petabyte (1 rack = 80 servers x 4 disks x 300 GB/disk = +- 100 TB). Since the rack uses node-to-node replication (every node has a sister node that holds a copy of all its data, so that if one of both nodes crashes, the data is still available), the net storage is 50TB.
So this got me thinking: how do you ‘copy’ the contents of PetaBox A to PetaBox B, how do you move 50TB?
Let’s try some numbers from my bandwidth calculator:

  • ISP: a regular ADSL/cable throughput is 1,3 TB/month. 50TB would take 38 months, almost 3,3 years – a bit slow.
  • LAN: a 100Mbps connection can theoretically deliver 32 TB/month, but let’s take 15 TB/month as a reasonable real throughput: 50TB would take 3,3 months. Gigabit Ethernet can go to 10TB/day max, so might realistically enable a full transfer in 7 days.
  • WAN: a dedicated OC12 optical line delivers a 6.72 TB/day. Even if in practice this would be only 3TB/day, this cuts the copy time to 17 days. With OC48, this goes up to a theoretical 26,4 TB/day, so the transfer is possible in something like 3-4 days.

Can this be done without a network connection? Can we tap the data out of one system, put it on some kind of transport and reload it at the new location? (see Microsoft’s Jim Gray who calls this a kind of ‘sneakernet‘)

  • use a third PetaBox C, set it up next to PetaBox A, connect them via Gigabit Ethernet, let them synchronize for 7 days, put PetaBox C on a truck/boat/plane, hope not too many disks are damaged during transport (this is a tricky bit, if both copies of a file are lost, you can start again), set it up next to PetaBox B and again let them replicate for a week. If the total procedure takes three weeks, you’ve just moved data at 2,17 TB/day or about 200 Mbps.
  • copy everything to Apple Xserve RAID systems. These have 14 disks of 400 GB, which is 5,4 TB/system unprotected storage, or (using RAID-5 per set of 7 disks) 4,8 TB. Since it uses a 400MB/s (3.2 Gbps) Fibre-Channel interface, the disk speed should not be a bottleneck. A system is filled over Gigabit Ethernet in a bit more than half a day (let’s say 1 week for all data), and you need 11 systems to store all data. Luckily, you can start shipping the first Apple RAID right after it’s filled, while the 2nd Xserve is still busy being filled. A fully equipped Xserve RAID weighs 45kg/110pounds, so you’ll need more than an envelope, but let’s say you could ship it anywhere in 2 days. Then the whole procedure will take 7 + 2 + 1 days = 10 days, which is a 5 TB/day or 463 Mbps transfer rate.
  • You could do something similar with Lacie Bigger Disk Extreme 1.6TB disks (although in my experience, these type of disks do not support continuous writing very well). Their bottleneck is probably the FireWire-800 write speed, which can be estimated at 25 MB/s or 90GB/hour. This means that it takes 17 hours to fill a Bigger Disk 1.6TB. You could probably fill several disks at the same time, since the Gigabit Ethernet can easily deliver that. In total you would need at least 32 full disks, but since there is no redundancy on the disks, you would need a system to check if all objects were copied correctly on the target system. This you could do by exchanging lists of object identifiers, file sizes and hashes, probably in files that are ‘only’ megabytes. So let’s say you need 40 disks (some objects will be transferred a 2nd or 3rd time if they arrived in bad state). We can ship them in packages of 5 disks – that’s 8TB at a time. These 5 disks take something like 30 hours to load (if we can always load 3 disks simultaneously). Total procedure: 8 * 30hrs + 2 days shipping + 30hrs to load the last pack: 13.25 days or 3,7 TB/day (350 Mbps).
  • There is the Sun StorEdge L500 Tape Library that could backup the complete 50TB, using up to 400 LTO cartridges. Its speed is 126 GB/hour or about 3 TB/day. So it would easily take over a month to backup PetaBox A, ship the StorEdge and restore the data to PetaBox B. That’s less than 150 Mbps.
  • just for fun: you would need over 60.000 CD-ROMs to pack those 50 TB. Don’t even think about how long it would take to actually write them, or who would write their unique number on the sleeves. There are double-sided writable DVDs of 8,75 GB each. With about 5800 of them, you could do the job.

This exercise is only half of the picture, of course. I did not take into account bandwidth, system, media and shipping prices. But since the PetaBox has no public pricetag, I didn’t bother searching for the other ones. Maybe later.

It’s the latency, stupid!

While working on some bandwidth-related stuff (my bandwidth calculator), I came across an excellent article on “latency vs. bandwidth” by Stuart Cheshire. It was originally written in 1996, so focuses a lot on modems, but Fact 1, 2 and 4 are still valid.

His points:

Fact One: Making more bandwidth is easy

You can just put enough slow connections in parallel to get a fast one.

Fact Two: Once you have bad latency you’re stuck with it

Parallel devices, compression, … nothing helps!

Fact Three: Current consumer devices have appallingly bad latency

Modems are evil (but now, with cable and ADSL, this is less of an issue)

Fact Four: Making limited bandwidth go further is easy

Compression and caching help a lot. (This article was written about the time MP3 was invented, but long before it became hugely popular. DivX came later, in 1999)

The following calculation is eye-opening:

# The distance from Stanford to Boston is 4320km.
# The speed of light in vacuum is 300 x 10^6 m/s.
# The speed of light in fibre is roughly 66% of the speed of light in vacuum.
# The speed of light in fibre is 300 x 10^6 m/s * 0.66 = 200 x 10^6 m/s.
# The one-way delay to Boston is 4320 km / 200 x 10^6 m/s = 21.6ms.
# The round-trip time to Boston and back is 43.2ms.
# The current ping time from Stanford to Boston over today’s Internet is about 85ms:
[cheshire@nitro]$ ping -c 1 lcs.mit.edu
PING lcs.mit.edu (18.26.0.36): 56 data bytes
64 bytes from 18.26.0.36: icmp_seq=0 ttl=238 time=84.5 ms

# So: the hardware of the Internet can currently achieve within a factor of two of the speed of light.

Definitions of latency:

Latency, a synonym for delay, is an expression of how much time it takes for a packet of data to get from one designated point to another

techtarget.com

Latency is the time a message takes to traverse a system

wikipedia.org