Mp3d:Unicode

From Definitelynotsafe Wiki

Jump to: navigation, search

The way mp3d handles Unicode is changing.

Contents

If you copy your music to ceylon, pay attention!

There is a new requirement for mp3d's music: all filenames must be NFC-normalized UTF-8 Unicode. If you copy music from a Mac, you will have to change the way you do the copying. You shouldn't have to change anything on your own machine.

1. Ensure that during the copy, filenames will be translated.

If you use rsync to copy from Mac OS X,

  • Upgrade to 3.0.0 or higher, and
  • Add the switch --iconv UTF-8-MAC,UTF-8 to your recipe

If you use Cyberduck to copy from Mac OS X or Windows,

  • Make sure to run a build higher than 8110, which includes a fix (8110 based on bug 5162) for normalization. You may also have to run a Terminal command on Mac OS X.

If you use FileZilla,

  • Make sure to run the latest copy of software. 3.3.5.1 is tested to natively understand normalization.

If you use something else,

  • contact me for guidance
  • or, if you figure it out, add it to this page

2. This might require recopying a very large amount of music. I can save time by fixing all the filenames on ceylon to avoid double-copying.

Exhaustive technical explanation of the problem

I'm going to move quickly here. If I leave you behind, google will lead you to better explanations than I could give. Go ahead, I'll wait.

Pretty much every computer system these days stores strings (such as filenames) as sequences of Unicode codepoints. Unicode is a character set that contains all the letters and symbols used by virtually every human language. Unicode strings must be encoded into a particular sequence of byte values. The two most popular encodings are UTF-16 and UTF-8. Linux, and most programs originally written for Linux, use almost exclusively UTF-8. Mac OS X and Windows both use primarily UTF-16. So this is well and good and more or less everyone has figured out how to make this work. For a good reference up to this point, I suggest Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Many characters can be represented two ways in Unicode. Consider é, the fifth letter of the English alphabet with an acute accent. This can be represented in Unicode two ways:

Codepoints Characters Codepoint names
U+0065 U+0301 e ´ LATIN SMALL LETTER E, COMBINING ACUTE ACCENT
U+00E9 é LATIN SMALL LETTER E WITH ACUTE

The first way is called decomposed (aka NFD): the complex character é is broken down into its simpler components e and acute-accent. The second way is called precomposed (aka NFC): the separate components e and acute-accent are combined into the whole character é.

A decomposed and precomposed string may "say" the same thing, but to a naive program they look different. Whether treated as a sequence of codepoints or a sequence of bytes, they are not the same. It takes a great amount of sophistication to understand when decomposed and precomposed strings truly mean the same thing.

Three pieces of software involved in mp3d have different ways of handling normalization:

  • Apache, being Linux software, treats filenames as a sequence of meaningless independent bytes. When Apache receives a URL, it looks for a file to serve with exactly that name. If there is a file with the same name in a different normalization, Apache will not see the file, and give a 404 error. This behavior may actually be part of Apache, APR, or the Linux kernel. For our (mp3d team's) purposes the distinction is irrelevant.
  • Web browsers, especially Safari treat URLs with non-English characters as sequences of UTF-8 codepoints. The W3 says URLs are expected to be composed (NFC), and Safari (the only browser I tried) assumes this. When given an NFD URL, it may on its own renormalize the URL as NFC.
  • Mac OS X's filesystem HFS+ stores all filenames normalized as decomposed (NFD) strings. If a Mac user copies a file to a Linux system hosting mp3d, the resultant filename will be in NFD. Normally, this is okay, because as mentioned above most Linux software treats filenames as inscrutable byte strings not to be tampered with.

The problem arises when all three meet, as they do in mp3d.

A file copied from a Mac system will have a name in NFD. mp3d, following Linux convention, will generate a URL based on its filename without changing normalization. Web browsers expect that URL to be NFC. They may convert the NFD URL to NFC before passing it back to Apache. Apache searches for a file with the NFC name, and fails, because the file has a different name in NFD.

Confused yet? I am.

So how do we fix it?

  • Change mp3d to generate all links as NFC. This doesn't help, because when the link goes to Apache, it still won't find the file!
  • Change Apache to search for files in different normalization. This seems difficult, and it may be philosophically wrong. mod_speling and mod_rewrite don't know how to fix this problem. It would be a great undertaking to change the innards of Apache to suddenly understand Unicode.
  • Stop using Apache and have mp3d resolve all requests and Do The Right Thing all the time with Unicode. Well that's just a big pain. I sure don't want to be on the hook for avoiding the security issues here. Directory traversal with user-supplied paths is a hard enough thing to deal with, and I think I actually have something that works now. However, I recognize that this is probably the platonic ideal solution. Patches gladly accepted!
  • Demand that all files mp3d serves must be named in NFC. This would not be enforced by mp3d, except by means of bugs.

I like the sound of that last one, because it means I can write shell instead of Perl, and also instead of doing things myself I can delegate them to other people.

Read more about it

Personal tools