EDIT: Wow, much to my surprise this really blew up on Hacker News. There are some pretty interesting discussions happening. (Thanks Stan!)
Yup, it’s true. In Javascript, "💩".length === 2
. You can open up a Chrome debug console, or Node.JS REPL and see for yourself. But why?! And why does '⛳'.length
only equal 1?
It all comes down to codepoints and our friend, Unicode. If you’re a little rusty on the details of Unicode and character sets, stop now, and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). It’s excellent, and I read it from time-to-time to refresh myself on details.
The next few paragraphs are summaries from this superb Javascript Unicode post by Mathias Bynens. It’s 5 years old, and sadly still true.
Anyways, the Unicode codepoint range goes from U+0000 to U+10FFFF which is over 1 million symbols, and these are divided into groups called planes. Each plane is about 65000 characters (16^4). The first plane is the Basic Multilingual Plane (U+0000 through U+FFFF) and contains all the common symbols we use everyday and then some. The rest of the planes require more than 4 hexadecimal digits and are called supplementary planes or astral planes. I have no idea if there’s a good reason for the name “astral plane.” Sometimes, I think people come up with these names just to add excitement to their lives.
The current largest codepoint? Why that would be a cheese wedge at U+1F9C0. 🧀 How did we ever communicate before this? EDIT: It turns out this isn’t quite accurate. I think cheese wedge is the highest codepoint emoji. On HN someone pointed out a few codepoints higher than this, namely those reserved for private uses, variation selector codepoints, and a CJK compatibility ideograph.
We can express characters in a couple different ways: "A" === "\u0041" === "\x41" === "\u{41}"
. These are escape sequences. The \x can be used for most (but not all) of the Basic Multilingual Plane, specifically U+0000 to U+00FF. The \u can be used for any Unicode characters. The curly braces are required if there are more than 4 hexadecimal digits and optional otherwise. This is for Javascript/HTML by the way. Other languages have their own sets of rules.
And "💩" === "\u{1F4A9}"
. Unfortunately, this is also true: "💩" === "\uD83D\uDCA9"
. What is this nonsense? All astral codepoints can also be represented by “surrogate pairs”, and this is used for backwards compatibility reasons. This is why "💩".length === 2
. There’s a formula to calculate surrogates from astral codepoints, and vice versa.
Given a codepoint C
greater than 0xFFFF, it corresponds to a surrogate pair <H,L>
.
So, in our case:
The .toString(16)
converts the number to a hexadecimal string. You can see that the answers correspond to the ‘\uD83D\uDCA9’ I had written above.
The whole reason I ran into this was because I was enforcing minimum password lengths, and I noticed that emoji counted as more than one. In fact if you paste the 💩 in a password field, you’ll see:
There’s an open Chromium bug for “Emoji in password fields appear as two bullets”. I filed the analagous Firefox and Safari bugs.
As an aside, here’s an article discussing the merits of emoji passwords.
So, is there a solution that counts symbols correctly? Bynens lists a couple possibilities. Array.from shows some promise. It’s succinct, works in Node, and is generally well supported across browsers, except IE11 and below, I think.
Array.from("💩").length === 1; //hooray!
however
Array.from("❤️").length === 2; //boooo!
From what I understand ❤️ is comprised of two codepoints: U+2764 and U+FE0F. The first is Heavy Black Heart. The second is Variation Selector 16 which changes the appearance of the preceding character! Ugh. In fact, U+FE00 through U+FE0F can all change the appearance of the previous character. With the case of this heart, only U+FE0F does. Other hearts(💙, 💚, 💛, 💜 ) each have their own codepoint, but the red heart requires two. I’m not sure why, but I asked on Stack Overflow.
To accommodate this I can change my code to something horrible like this:
In .split(/[\ufe00-\ufe0f]/).join("")
I’m using a regular expression to essentially remove all the variation selectors.
So, we’re done right? Nope! The more I wrote in this blog post, the deeper the rabbit hole. Putting together woman, heart, kiss, women emojis are frequently rendered as a single emoji.
👩❤️💋👩 is created by having a Zero Width Joiner \u{200D}
character between the component emojis. So we could do something like this:
Honestly, you should probably never use this. Not all browsers, UIs, etc even render 👩❤️💋👩 as a single symbol. The code assumes the joiners are used between characters appropriately which could be very problematic. During my research I noticed that U+200C allows for ligatures and fancyCount2
isn’t accounting for that. That’s an easy adjustment, but I’m sure there’s even more modifiers and joiners that I’ve never heard of. I know you’re on the edge of your seat waiting for the ultimate solution, but this rabbit hole is too deep! Sorry for the disappointment, but if you know of a more robust, comprehensive character counter, I’d love to hear from you!