Library of The Week: Punycode (and Other Unicode Tips)
> '𝗝𝗦 forever'.substring(0,2) '𝗝'
First, let's quickly review some basic facts about Unicode. Unicode identifies individual characters using hexadecimal numbers called "code points" in the range from U+0000 to U+10FFFF. Various character encodings are available, but for our purposes UTF-16 is the most important. The Unicode character space is divided into 16 planes of 65536 code points each.
Characters in the first plane, called the Basic Multilingual Plane, are represented in UTF-16 using 2 bytes. Any other character needs 4 bytes. You may also come across the term "code unit", which refers to an atomic 2-byte chunk. So you can say that in UTF-16 each character is represented by 1 or 2 code units. The name UTF-16 refers to the fact that each code unit has 16 bits.
If you look carefully, you will see that the first two characters from our example are not ASCII letters but rather symbols from the Mathematical Alphanumeric Symbols contained in a supplementary plane.
> '𝗝𝗦'.length 4
substring doesn't work correctly.
When you really need to handle supplementary Unicode characters, you can use Punycode (which is bundled with Node.js). It converts strings to and from their Unicode code points.
> punycode = require('punycode'); > punycode.ucs2.decode('𝗝𝗦') [ 120285, 120294 ]
This is fine for working with individual characters, getting string length, etc. One area where it doesn't help is matching regular expressions. Here it's the same story:
. doesn't match a real Unicode character but only a single code unit. Continuing our example using math symbols from the Supplementary Multilingual Plane:
> '𝖠𝖡𝖢'.match(/𝖠.𝖢/) > null > '𝖠𝖡𝖢'.match(/𝖠./) > [ '𝖠�', index: 0, input: '𝖠𝖡𝖢' ]
Nothing matched on the first attempt, and only half a character was matched the second time. The best solution is to use
ES6, which comes with some nice Unicode improvements. One of them is a Unicode flag for regular expressions. Firing up Babel, which compiles ES6 into ES5 for current Node implementations:
$ echo "console.log('𝖠𝖡𝖢'.match(/𝖠.𝖢/u))" | babel | nodejs [ '𝖠𝖡𝖢', index: 0, input: '𝖠𝖡𝖢' ] $ echo "console.log('𝖠𝖡𝖢'.match(/𝖠./u))" | babel | nodejs [ '𝖠𝖡', index: 0, input: '𝖠𝖡𝖢' ]
ES6 feature is String.prototype.codePointAt, which fills basically the same role as Punycode.
If you can't use