Library of The Week: Punycode (and Other Unicode Tips)

Unicode support in JavaScript is far from perfect. Consider the following:

> '𝗝𝗦 forever'.substring(0,2)

What's wrong?

First, let's quickly review some basic facts about Unicode. Unicode identifies individual characters using hexadecimal numbers called "code points" in the range from U+0000 to U+10FFFF. Various character encodings are available, but for our purposes UTF-16 is the most important. The Unicode character space is divided into 16 planes of 65536 code points each.

Characters in the first plane, called the Basic Multilingual Plane, are represented in UTF-16 using 2 bytes. Any other character needs 4 bytes. You may also come across the term "code unit", which refers to an atomic 2-byte chunk. So you can say that in UTF-16 each character is represented by 1 or 2 code units. The name UTF-16 refers to the fact that each code unit has 16 bits.

Now let's return to our example. JavaScript treats every code unit as an individual character. This is not generally a problem, because characters outside of the Basic Multilingual Plane (represented by single code unit) are rarely used (unless you happen to be localizing your app into cuneiform, Old Turkic or the like).

If you look carefully, you will see that the first two characters from our example are not ASCII letters but rather symbols from the Mathematical Alphanumeric Symbols contained in a supplementary plane.

> '𝗝𝗦'.length

Both Unicode letters needs 2 JavaScript characters, which is why substring doesn't work correctly.

When you really need to handle supplementary Unicode characters, you can use Punycode (which is bundled with Node.js). It converts strings to and from their Unicode code points.

> punycode = require('punycode');
> punycode.ucs2.decode('𝗝𝗦')
[ 120285, 120294 ]

This is fine for working with individual characters, getting string length, etc. One area where it doesn't help is matching regular expressions. Here it's the same story: . doesn't match a real Unicode character but only a single code unit. Continuing our example using math symbols from the Supplementary Multilingual Plane:

> '𝖠𝖑𝖒'.match(/𝖠.𝖒/)
> null
> '𝖠𝖑𝖒'.match(/𝖠./)
> [ '𝖠�', index: 0, input: '𝖠𝖑𝖒' ]

Nothing matched on the first attempt, and only half a character was matched the second time. The best solution is to use ES6, which comes with some nice Unicode improvements. One of them is a Unicode flag for regular expressions. Firing up Babel, which compiles ES6 into ES5 for current Node implementations:

$ echo "console.log('𝖠𝖑𝖒'.match(/𝖠.𝖒/u))" | babel | nodejs
[ '𝖠𝖑𝖒', index: 0, input: '𝖠𝖑𝖒' ]
$ echo "console.log('𝖠𝖑𝖒'.match(/𝖠./u))" | babel | nodejs
[ '𝖠𝖑', index: 0, input: '𝖠𝖑𝖒' ]

Another Uniocode-related ES6 feature is String.prototype.codePointAt, which fills basically the same role as Punycode.

If you can't use ES6, you can check out CSET. It converts supplementary characters in regular expressions into individual code units that today's JavaScript engines can understand.

If you are interested in reading more about advanced use of Unicode in JavaScript, stop by Mathias Bynens's blog. Mathias is the author of the Punycode library and has been thinking about and writing on this topic for a long time.

Roman Krejčík

Roman Krejčík