Do not fear Unicode

Although highly informative, Dave Shea’s advice on foreign languages is a bit confusing when it comes to Unicode.

First of all, Unicode is not a file format. It is mainly a long list of names and numbers for each character known to man (and then some more). Read Tim Bray’s excellent On the Goodness of Unicode to find out what Unicode exactly is, how it works, and why is exists.

Unicode characters are stored in files using one of the Unicode character encodings, the most well-known of which are utf-8 and utf-16. On the web, always use utf-8.

The first 128 characters in utf-8 are encoded in exactly the same way as the first 128 characters in most western and east-european single byte encodings. When you open an utf-8 encoded file in a text editor, and the text editor does not know how to handle utf-8 (or does not know what it is looking at), you can still see all of the basic latin characters. The foreign characters will however show up as weird multi-character garbage. When this happens, make sure you are using a text of html editor that supports utf-8, and tell your editor to treat it as such.

IMHO there is no reason to fear Unicode. Once you make sure you are using tools that support Unicode and you understand how to make them read and write utf-8, life becomes a lot easier.

Because Unicode contains all existing characters, you can just copy any foreign text straight from your web browser or word processor into your editor. You don’t have to remember any character entities anymore. You can use decent punctuation and nice arrows. Hell, you can even decide to post weblog items in Japanese!

(Non-asian Windows users will probably have to install some fonts if they like to see it. An easy way to do this is to visit Microsoft Japan. Mac owners and people running recent Linux distro’s have all the needed fonts installed.)