Table of Contents

What is Unicode or UTF-8

Intro

Unicode is a universal character set. This means that the standard defines, in one place, all the characters needed for writing the majority of living languages, punctuation, or other symbols like emojis 😇.

coding horror tweet

In the past, a common character set was ASCII, which was very limited because there was just 8 bit of information available for a character. This meant that 256 characters can be used. Not enough for a World Wide Web.

A code/number associated with a character is called a code point. The actual image used for that character is called a glyph. Even though ‘A’ has the same code point in ASCII, which is ’65’, the representation on the screen differs with respect to the fonts used.

ASCII.svg

Unicode character set

The first 65,536 code point position in the Unicode character set constitute the Basic Multilingual Plane(BMP). It contains a small part of the set but that part is more commonly used. The rest of approximately a million code point positions are referred to as supplementary characters.

Unicode has 3 different encoding forms, UTF-8, UTF-16, UTF-32. UTF-8 uses one byte for ASCII set, two bytes for several more alphabets, and three bytes for the rest of BMP, four bytes for the rest of the BPM.

UTF-16 uses two bytes for the BMP and four bytes for supplementary characters.

UTF-32 uses four bytes for the BMP and four bytes for supplementary characters.

Endianness refers to the ordering of bytes in a word. A big-endian system stores the most significant byte in the word at the smallest memory address.

When in doubt, use UTF-8. It is simpler this way.

In order to know how to interpret a file, a program must contain the file’s encoding. Character encoding detection is not a reliable process.

JavaScript source files can have any kind of encoding but the code will be converted ot UTF-16 internally. For data structures usually, UTF-8 is the default.

 const bt = Buffer.alloc(12);
 bt.write('abcdef'); // defaults to 'utf8'
 console.log(bt);
 console.log(bt.toString());
 bt.write('abcdef',0, 6,'ascii');
 console.log(bt);
 console.log(bt.toString());
 bt.write('abcdef',0, 12,'utf16le');
 console.log(bt);
 console.log(bt.toString());
 console.log(bt.toString('utf8'));
 console.log(bt.toJSON());

For more details you can see Nick Gammon’s take on the subject.

Need Help with Onboarding?
Tell Us Your Challenges.