Understanding UTF-8 and Unicode in Perl

Hi! If any of these seem familiar to you, you're in the right place to figure out what to do:

I'm assuming that you want to use UTF-8, and your system is set up to use UTF-8, and your only problem is that something is messing it all up. Let's have a look at it, then, shall we?

Test subject

Let's use an example string to explain everything. My example string is: "äδДښ". Those characters are fairly likely to be included in most modern fonts yet sufficiently spread across various alphabets that you really need Unicode to be able to see them all.

The basics

First off, what's Unicode and what's UTF-8? Well, glossing over some nasty details, Unicode is essentially a huge set of characters, each of which has a number ("code point") associated with it. To keep things simple, the code points for all ASCII characters are the same as the corresponding ASCII codes, and in fact the code points for most ISO-8859-1 characters are the same as the corresponding ISO-8859-1 codes, too (with something like three exceptions).

Let's take the second character from our example string: "δ" (lowercase delta). Its corresponding code point is 0x03B4. In the context of Unicode, we write code points like this, though: U+03B4

Obviously that code point doesn't fit in a byte. Enter UTF-8, a way to encode all possible Unicode code points (ranging from 0x0 to 0x10FFFF) as sequences of bytes of varying length, with the convenient property that all ASCII characters are encoded in exactly the same way as in ASCII. That is to say: "A" is 0x41 in both ASCII and UTF-8. Non-ASCII characters are encoded in UTF-8 as sequences of bytes that have the most significant byte set (i.e. are larger than 0x80). For example, our friend "δ" is encoded as the UTF-8 sequence 0xCE 0xB4.

For the sake of completeness, here's our example string with both its Unicode code points and the UTF-8 encoding:

        ä          δ          Д          ښ         ← literal characters
   U+00E4     U+03B4     U+0414     U+069A         ← Unicode code points
0xC3 0xA4  0xCE 0xB4  0xD0 0x94  0xDA 0x9A         ← UTF-8 encoding

The Perl-specific basics

Perl introduced Unicode support at some point. To preserve backward compatibility, it was decided to do that by adding a second type of string.

The big difference between those types of strings is that the old strings are byte-oriented (each index of the string refers to a byte; this means that, in UTF-8, characters can take up several indexes in the string) whereas the new strings are character-oriented (every Unicode character takes up a single index in the string; in other words, the string is a list of Unicode code points). For clarity:

If you don't do anything special, you'll only ever use byte strings in your code. String constants in your source containing literal UTF-8 (e.g. if you typed a literal "ä" into your editor and subsequently saved the file as UTF-8) will simply become UTF-8 encoded byte strings. Assuming you always read and write UTF-8, that's all your work done (except you can run into nasty-looking output if, for example, you accidentally truncate strings in the middle of a UTF-8 sequence).

If you don't care about that but need to convert strings between various encodings, you can use Perl's Encode module, particularly its from_to function. No big deal.

However, if running into messed up UTF-8 sequences bothers you, you'll want to transition to Unicode strings. So, let's talk about ways in which Perl starts using those.

Creating Unicode strings

There are many ways to go about that.

Outputting Unicode strings

At some point you'll probably want to output your data. This is where it gets interesting.

With byte strings, and if you do nothing special, they'll simply be output byte by byte, with no changes. So far, so good. With Unicode strings, it depends on whether you are using any modifiers (as described above). If you are, the encoding specified by the modifier will be used. Otherwise, depending on whether you're using Perl's locale features, you'll either get the data transparently encoded into the system's native encoding, or Perl will wing it and probably get it wrong (and give you the fancy "Wide character in print" warning to let you know). Our example string will most likely show up as "��" or something like that.

On the other hand, if you're using UTF-8 in a byte string, the same modifiers will screw things up in exactly the opposite way: the encoding layer will treat the byte string as if each byte was a Unicode code point. This means that multi-byte characters are output as individual character. In that case, this is what you see when outputting our example string: "äδДښ". Nasty!

The rules