Unicode in Python
I keep forgetting when to encode and when to decode. Therefore a short reminder on how to convert between plain string objects and Unicode string objects in Python.
Encoding and decoding
s.encode(codec=None, errors='strict')
Returns a plain string encoded from the (plain or Unicode) string s using the given encoding (for example 'ascii', 'latin-1', 'iso-8859-1' to 'iso-8859-15' or 'utf-8') and error handling ('strict', 'replace' or 'ignore').
s.decode(codec=None, errors='strict')
Returns an Unicode string decoded from the plain string s using the given encoding and error handling. This method is equivalent to the unicode function.
unicode(s, codec=None, errors='strict')
Returns an Unicode string decoded from the given plain string s using the given encoding and error handling.
Examples
Printing an Unicode string
>>> test = u"Make \u0633\u0644\u0627\u0645, not war."
>>> print test
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
An Unicode string literal is assigned to test and printed to the console. The unicode string contains a few Arabic characters escaped using \u followed by the character code as four hexadecimal digits.
When printing, Python implicitly encodes the Unicode string using the default codec. An error is raised because the default 'ascii' codec only accepts characters with codes between 0 and 127.
Below the encoding is done explicitly:
>>> print repr(test)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
Use a different error handling method to suppress the error:
>>> print test.encode('ascii', 'replace')
Make ????, not war.
>>> print test.encode('ascii', 'ignore')
Make , not war.
Use the 'utf-8' codec to actually display the Arabic characters. This only works on a console that supports utf-8 encoding.
>>> print test.encode('utf-8')
Make سلا, not war.
Files and character encodings
Besides basic latin characters, the following file uses the accented letters ô, é and î and the left and right pointing guillemet from the latin 1 supplement, and the right single quotation mark and horizontal ellipsis from the general punctuation code blocks. The file is encoded as utf-8.
Use the urllib module to download the file, then read it into a string object:
>>> import urllib
>>> petit = urllib.urlopen(
... "http://stuff.vandervossen.net/external/2003/petit.utf8")
>>> test = petit.read()
>>> test
'Alors vous imaginez ma surprise, au lever du jour, quand une dr\xc3\xb4le de petit voix m\xe2\x80\x99a r\xc3\xa9veill\xc3\xa9. Elle disait: \xc2\xab S\xe2\x80\x99il vous pla\xc3\xaet\xe2\x80\xa6 dessine-moi un mouton! \xc2\xbb\n'
In the last line, the interactive interpreter calls test.__repr__() for the test expression. This results in a complete string representation of test, so that eval(repr(x))==x.
In this representation, the content of the file is displayed byte by byte. All values below 128 are shown as ascii characters, the values 128 to 255 are escaped using \x followed by the character code as two hexadecimal digits.
As can be seen, the utf-8 encoding for the accented letter ô is two bytes long (\xc3\xb4), the encoding of the horizontal ellipsis is three bytes long (\xe2\x80\xa6).
Use the decode method to decode the byte sequence to a Unicode string object:
>>> testu = test.decode('utf-8')
>>> testu
u'Alors vous imaginez ma surprise, au lever du jour, quand une dr\xf4le de petit voix m\u2019a r\xe9veill\xe9. Elle disait: \xab S\u2019il vous pla\xeet\u2026 dessine-moi un mouton! \xbb\n'
In this representation, characters with codes between 127 and 256 are escaped using \x followed by the character code as two hexadecimal digits. Characters with codes 256 and higher are escaped with \u followed by four hexadecimal digits.
The decimal character code for the accented letter ô is 244 (\xf4), the character code of the horizontal ellipsis is 8230 (\u2026).
Use the len() function to return the number of items in each object:
>>> len(test)
164
>>> len(testu)
152
With utf-8, these 152 Unicode characters are encoded using 164 bytes.
Using the encode method, the Unicode string can be encoded to an iso-8859-1 encoded single-byte string:
>>> testu.encode('iso-8859-1', 'replace')
'Alors vous imaginez ma surprise, au lever du jour, quand une dr\xf4le de petit voix m?a r\xe9veill\xe9. Elle disait: \xab S?il vous pla\xeet? dessine-moi un mouton! \xbb\n'
The 'replace' error handling method replaces Unicode characters that cannot be represented in iso-8859-1 with a question mark.
The Unicode string can also be encoded to utf-16:
>>> testu.encode('utf-16')
'\xfe\xff\x00A\x00l\x00o\x00r\x00s\x00 \x00v\x00o\x00u\x00s\x00 \x00i\x00m\x00a\x00g\x00i\x00n\x00e\x00z\x00 \x00m\x00a\x00 \x00s\x00u\x00r\x00p\x00r\x00i\x00s\x00e\x00,\x00 \x00a\x00u\x00 \x00l\x00e\x00v\x00e\x00r\x00 \x00d\x00u\x00 \x00j\x00o\x00u\x00r\x00,\x00 \x00q\x00u\x00a\x00n\x00d\x00 \x00u\x00n\x00e\x00 \x00d\x00r\x00\xf4\x00l\x00e\x00 \x00d\x00e\x00 \x00p\x00e\x00t\x00i\x00t\x00 \x00v\x00o\x00i\x00x\x00 \x00m \x19\x00a\x00 \x00r\x00\xe9\x00v\x00e\x00i\x00l\x00l\x00\xe9\x00.\x00 \x00E\x00l\x00l\x00e\x00 \x00d\x00i\x00s\x00a\x00i\x00t\x00:\x00 \x00\xab\x00 \x00S \x19\x00i\x00l\x00 \x00v\x00o\x00u\x00s\x00 \x00p\x00l\x00a\x00\xee\x00t &\x00 \x00d\x00e\x00s\x00s\x00i\x00n\x00e\x00-\x00m\x00o\x00i\x00 \x00u\x00n\x00 \x00m\x00o\x00u\x00t\x00o\x00n\x00!\x00 \x00\xbb\x00\n'
Note: This is work-in-progress. I will correct errors and add some more on easy file handling and new Unicode features in Python 2.3 soon.
Update: You might also want to read effbot.org’s Observations on Working With Non-ASCII Character Sets and Joel on software’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.