Monthly Archive for November, 2004

M-Audio FastTrack vs. Line6 GuitarPort


M-Audio just released the FastTrack USB, a break-out box for recording guitar and voice. Looks like they made a GuitarPort clone, but with a more ‘serious’ design. That is: a basic A/D converter with a jack input and a USB output, and a bunch of effect software for your computer.
A quick comparison:

FastTrack GuitarPort
Price: $129 $99
Input: 1/4″ jack, line, XLR 1/4″ jack
Controller: gain, input, output output
Software: GT Player Express (1.0?) GuitarPort 2.5 + RiffWorks
Platform: Windows/Mac Windows


I currently own a GuitarPort and I love it. It does an awesome ‘Smoke on the water’ and ‘Long train running’, and it works without a glitch. The FX software shows every effect (noise gate, delay, reverb, phaser, …) as an amp frontpanel. I just upgraded to the 2.5 software. At first sight, they added a parametric equalizer, a metronome and they finally figured out it made sense to group the presets in genres. You gotta love Line6: they sell people great hardware and keep giving them better software for free every couple of months.
I also have an M-Audio Ozone and that has been a bit of a disappointment. The latency of the MIDI controller is way too big, and the keyboard seems to lose its USB connection every couple of minutes, which makes for a stack of warning windows if you leave the PC alone for an hour. Anyway, I’m going to upgrade my DAW PC to Windows XP and see if it works any better.

How do you move a terabyte?


I recently discovered Brewster Kahle’s speech on the NotCon ’04 podcast about the ambition of The Internet Archive to archive absolutely everything (all books, all movies, all music, …). (There is an excellent transcript on www.hotales.org .) They are currently setting up a second datacentre in Amsterdam, as an off-site copy of the original archive.org. They use massive parallel storage nodes grouped together in a PetaBox rack. You actually need 10 Petaboxes to get to 1 Petabyte (1 rack = 80 servers x 4 disks x 300 GB/disk = +- 100 TB). Since the rack uses node-to-node replication (every node has a sister node that holds a copy of all its data, so that if one of both nodes crashes, the data is still available), the net storage is 50TB.
So this got me thinking: how do you ‘copy’ the contents of PetaBox A to PetaBox B, how do you move 50TB?
Let’s try some numbers from my bandwidth calculator:

  • ISP: a regular ADSL/cable throughput is 1,3 TB/month. 50TB would take 38 months, almost 3,3 years – a bit slow.
  • LAN: a 100Mbps connection can theoretically deliver 32 TB/month, but let’s take 15 TB/month as a reasonable real throughput: 50TB would take 3,3 months. Gigabit Ethernet can go to 10TB/day max, so might realistically enable a full transfer in 7 days.
  • WAN: a dedicated OC12 optical line delivers a 6.72 TB/day. Even if in practice this would be only 3TB/day, this cuts the copy time to 17 days. With OC48, this goes up to a theoretical 26,4 TB/day, so the transfer is possible in something like 3-4 days.

Can this be done without a network connection? Can we tap the data out of one system, put it on some kind of transport and reload it at the new location? (see Microsoft’s Jim Gray who calls this a kind of ‘sneakernet‘)

  • use a third PetaBox C, set it up next to PetaBox A, connect them via Gigabit Ethernet, let them synchronize for 7 days, put PetaBox C on a truck/boat/plane, hope not too many disks are damaged during transport (this is a tricky bit, if both copies of a file are lost, you can start again), set it up next to PetaBox B and again let them replicate for a week. If the total procedure takes three weeks, you’ve just moved data at 2,17 TB/day or about 200 Mbps.
  • copy everything to Apple Xserve RAID systems. These have 14 disks of 400 GB, which is 5,4 TB/system unprotected storage, or (using RAID-5 per set of 7 disks) 4,8 TB. Since it uses a 400MB/s (3.2 Gbps) Fibre-Channel interface, the disk speed should not be a bottleneck. A system is filled over Gigabit Ethernet in a bit more than half a day (let’s say 1 week for all data), and you need 11 systems to store all data. Luckily, you can start shipping the first Apple RAID right after it’s filled, while the 2nd Xserve is still busy being filled. A fully equipped Xserve RAID weighs 45kg/110pounds, so you’ll need more than an envelope, but let’s say you could ship it anywhere in 2 days. Then the whole procedure will take 7 + 2 + 1 days = 10 days, which is a 5 TB/day or 463 Mbps transfer rate.
  • You could do something similar with Lacie Bigger Disk Extreme 1.6TB disks (although in my experience, these type of disks do not support continuous writing very well). Their bottleneck is probably the FireWire-800 write speed, which can be estimated at 25 MB/s or 90GB/hour. This means that it takes 17 hours to fill a Bigger Disk 1.6TB. You could probably fill several disks at the same time, since the Gigabit Ethernet can easily deliver that. In total you would need at least 32 full disks, but since there is no redundancy on the disks, you would need a system to check if all objects were copied correctly on the target system. This you could do by exchanging lists of object identifiers, file sizes and hashes, probably in files that are ‘only’ megabytes. So let’s say you need 40 disks (some objects will be transferred a 2nd or 3rd time if they arrived in bad state). We can ship them in packages of 5 disks – that’s 8TB at a time. These 5 disks take something like 30 hours to load (if we can always load 3 disks simultaneously). Total procedure: 8 * 30hrs + 2 days shipping + 30hrs to load the last pack: 13.25 days or 3,7 TB/day (350 Mbps).
  • There is the Sun StorEdge L500 Tape Library that could backup the complete 50TB, using up to 400 LTO cartridges. Its speed is 126 GB/hour or about 3 TB/day. So it would easily take over a month to backup PetaBox A, ship the StorEdge and restore the data to PetaBox B. That’s less than 150 Mbps.
  • just for fun: you would need over 60.000 CD-ROMs to pack those 50 TB. Don’t even think about how long it would take to actually write them, or who would write their unique number on the sleeves. There are double-sided writable DVDs of 8,75 GB each. With about 5800 of them, you could do the job.

This exercise is only half of the picture, of course. I did not take into account bandwidth, system, media and shipping prices. But since the PetaBox has no public pricetag, I didn’t bother searching for the other ones. Maybe later.

Idea: using a URI for sending email

In order to send an email over SMTP, you need 2 sets of information:

  • WHAT: the content of the email, i.e. from-address, subject, to/cc addresses and the actual message (in TXT and/or in HTML)
  • HOW: you also need the name/address of an SMTP server (and its SMTP port), and optionally a username/password to authenticate to the server (ESMTP) – these would typically be fixed for the email sending application.

On the other hand, you have the HTTP GET format where you can put everything you need to execute the request in 1 string:
http://james:password@www.example.com /hr /request /?type=holiday&start=2004-11-22&end=2004-11-24
No need to save the ‘server/port’ data separate from the ‘request’ location or from the actual content.

Which inspired me to a similar format for sending email:

smtp://[username[:password]]@server[:port] /from:address [/subject:text] [/to:address] [/cc:address] [/bcc:address] [?[subject=text] [&message=text] [&to=text] [&cc=text] [&bcc=text] [&attachment=file]]

Some examples:

  • smtp://relay.example.com /from:me@example.com /subject:Test+was+OK%21 /to:james@destination.com /?message=Here+I+am%21
  • smtp://localhost:2525 /from:monitor@example.com /subject:webserver+3+is+down ?to=support@example.com&message=webserver+3+is+down+(time-out+after+10+seconds)
  • smtp://james:password@localhost /from:james@example.com /subject:this+is+authenticated /to:test@example.com

Keep in mind: a URL is typically limited to 255 characters (depending on implementation), and a querystring (the part after the “?”) is limited to 4KB.

  • Out of principle, the from address should always be specified as the 1st level pseudo-folder /from:address/. Because it should never be longer than 100 characters, and there should always be one, and only one.
  • If your subject is more than 200 characters, you need the ?subject=text notation. Otherwise the /subject:text/ is preferable because you would be less inclined to specify more than 1 subject.
  • If your message content is more than 200 characters, you need the ?message=text. If you need more than 4KB, you could use the equivalent of a HTTP POST, i.e. not specify it in the URL string, but stream the whole this after giving the request.
  • You could allow multiple /cc:address/ entries, or just use a /cc:addr1;addr2;addr3/
  • to allow ‘pretty’ email addresses, you could allow /from:peter@example.com:Peter+Forret/. (Any better suggestions for that?)
  • You could allow a smtp://default/... if you still want your email sending application to choose its SMTP server.
  • Attachments can be specified with an address on a local disk, or on the Internet (with another URI)

The advantages of this system?

  1. you can specify a complete email in 1 single text string (with size limitations)
  2. one could easily specify a lists of emails to be sent (e.g. a text file with 1 email per line), and use different SMTP servers for some of them (e.g. send the Hotmail/MSN ones directly to mx?.hotmail.com for speedy delivery)
  3. it can be interpreted by the SMTP client (email sending application), that translates it into a regular SMTP conversation (HELO ... MAIL FROM ... RCPT TO ... DATA)
  4. it could easily be accepted by a web server, that accepts the URI with the smtp:// replaced by http://. One could send an email by just clicking on a link in a browser. The server would of course not accept these requests from anyone, otherwise it would be an easy spam machine. But one could send emails with cURL or WGET. How about that?
  5. if a company stores all sent messages on a special server, you could have a search-within-your-emails-site with +- the same URI format:
    http://peter:password@searchmail.example.com /from:peter@example.com /?subject=*contract* &notbefore=2004-11-15 &notafter=2004-11-20 &to=*robert*
  6. One could also adopt smtps:// for SMTP over TLS.

If anyone can do something useful with this, be my guest!

ID3 metatags for podcast MP3s

There are 2 kinds of MP3 files I regularly download and throw on my iPod: podcasts and DJ mixes. Both suffer from the same problem: chaos and inconsistency in the usage of the ID3 metatags:

  • ‘Artist’ and ‘Album’ are not filled in
  • the ‘Album’ tag is used for the free-text description of the content
  • Some podcasters keep the ‘Artist’ field constant, some the ‘Album’ field, some change their logic every now and then
  • The ‘Title’ field always starts with the same 50 characters, so that if you see a bunch of them listed on you iPod, there’s no telling them apart

ID3 was clearly developed for CDs/albums and the podcasts/mixes above don’t really fit into that mould. Let’s elaborate on this.
There are actually two main types of albums:

the ‘Artist’ paradigm

A group or artists (e.g. “U2″) make a new album (e.g. “Achtung Baby”) at some point in time (e.g. 1991). All the songs on the album would have the same ‘Artist’, ‘Album’, ‘Genre’ and ‘Year’ tag. Each would have a different song title.

the ‘Compilation’ paradigm

Here the ‘Album’ tag is constant, but all other tags can change. Services like GraceNote CDDB or FreeDB actually link the main ‘Genre’ and ‘Year’ to the album, not the song, so those would be the same for all the songs. But the ‘Song’ and ‘Artist’ tag can change for each track. (CDDB actually stores a bit field to indicate whether an album is a compilation)

Then you have the special cases:

‘Mixed/chosen by’ paradigm

“A night at the Playboy Mansion” is a compilation album (because it features songs by a bunch of people) but the album is released as a Dimitri from Paris one. I would consider this a compilation, just to get the correct (different) artist for each song. For this type of compilation (e.g. the LateNightTales series), CDDB often gets it wrong: gives all songs the same ‘Artist’ field.

the ‘Remixes’ album

Contains songs by one group/artist, but they have been remixed. Who is the ‘Artist’ of a remix? I’m not even gonna mention (legal) copyright issues here. For the “Depeche Mode: Remixes 81-04″, the DM fans will claim that the artist is still the same, which means that all the remixer information has to go in the ‘Artist’ field, and leads to titles like “Master And Servant (An ON-Usound Science Fiction Dance Hall Classic – Adrian Sherwood)”. Do you want these tracks to show up when you select ‘Artist’ = “Depeche Mode” as your playlist? I guess you would.

PODCASTS
So in what mould would podcasts fit? Let’s take three examples:

  • Daily Sourcecode
    Adam is a podcasting pioneer so he has already figured out how to use the metatags on his MP3s:

    Title: “DSC-2004-11-20″

    only 14 characters, to make sure you see the whole title on your iPod. Also, when you sort all songs alphabetically on ‘Title’, they are also sorted (reverse) chronologically. For a daily/weekly podcast, this makes a lot of sense. If the title contained a list of topics, it would be very hard to recall which ones you have already heard, whereas now you just need to remember how many days backlog you have.

    Artist: “Adam Curry”

    Well, the SourceCode is a one-man-show, so this is the only right ‘Artist’ field!

    Album: “1st cast from the cottage”

    This is the tricky one. The ‘Album’ field is the only one left to throw in some information on the contents, so that it shows up on the iPod. But having free-text bogus ‘Album’ names in iTunes, means that the ‘Album’ drill-down becomes very cluttered. A solution would be to use a “Daily Sourcecode: …” as ‘Album’, but unfortunately, the iPod only lets the ‘Title’ field scroll when it’s too long, not the other fields. So you could get “Daily Sourcecode: some ranting abou…” as ‘Album’ and that wouldn’t explain much. If Adam would have a lot of personal podcasts, then this should be “Daily Sourcecode”, and all variable information should go into the ‘Title’ field. The way Adam did it, is better for iPod usage, and less for iTunes usage.

    Genre: “Podcast”

    Most podcasts put the ‘Genre’ to “Podcast” or “Speech”. This is a very good practice, since you can use a Smart Playlist on your iPod that shows you all the files with Genre = “Podcast” and Play Count = “0″, i.e. “all new podcasts”.
  • ITConversations
    This is a great podcast, but not a one-man show (so there are different ‘Artists’). Doug Kaye has been publishing since 2002 so also here the ID3 tags are well thought through:

    Title: “November 18, 2004″/”Elections 2004″

    For recurring programs with various topics (Gillmor Gang) just the date, and for events a short topic description. Best of both worlds!

    Artist: “The Gillmor Gang”/”Ed Cone”

    The ‘Artist’ refers to the speaker(s) of the program.

    Album: “IT Conversations”/”Bloggercon III”

    ITConversations can be compared to a ‘Record Label’ that produces ‘Compilations’. There is a ‘Gnomedex 4.0′ compilation, a ‘Bloggercon III’ compilation, and a general ‘IT Conversations’ album that includes the Gillmor Gang.

    Genre: “Speech”

    Consistently used in all podcasts.

    Copyright: “RDS Strategies LLC”

    Interesting information, but no way to find out who RDS is, or how to contact them in case you would want to redistribute the content
  • WeFunk Radio
    Great music, but the MP3 tags leave something to be desired:

    Title: “WeFunk_Show_354_2004-11-06″

    Just too long, and clearly the same as the filename (so underscores instead of spaces). Better would be: “WeFunk 2004-11-06″ or “WeFunk Show #354″.

    The rest: empty

    Obviously not good

Some more remarks:

  • If you have a website and you want visitors, include the URL in the metatags. The MP3 might start leading its own life (get copied, transferred) and an interested listener might not have seen the RSS/site the MP3 was published on.
  • Why not include a link to a Creative Commons license?
  • With the ID3v2 tags, the tags should be in the beginning of the MP3 file. A cache/proxy could start streaming the MP3 and adapt e.g. ‘Copyright’ and ‘Genre’ fields in the first 5KB.