« HOME
Text, Unicode, UTF Encodings and Confusions

This is one of the topics that you need to know and understand even if you are not a programmer but use computers daily for various tasks. Let’s start with basic definitions.

What is Encoding?

Encoding in computer dictionary meaning is “convert into a coded form.” That’s not very useful – “What is coded form?”. Let’s make it simpler. In a generic sense, encoding is taking something in one form and using a set of rules to convert it to another form.

Why do we need encoding anyway?

Computers are wonderful but the only thing they can actually store under the hood is bits – effectively ones and zeros. Whether you are storing image, text, number, or binary blob does not make the difference it all turns into 1s and 0s. Computers have a different encoding scheme to store these in the form of bits. For numbers in the form of 1’s complement or 2’s complement. What we are talking about here is mostly encoding schemes used for text. Considering text is pretty vital to the everyday activity of regular computer users makes this topic more important. When you open your computer memory and examine the content only thing you will see is a bunch of bit sequences. Without context, these bit sequences do not make sense. (well, the same goes for computers)

If you are tasked to design something pretty simple for turning texts into bits, the first stop would be assigning numbers to those letters somehow. Let’s start with the alphabet, our character set now contains 26 characters, If we want to distinguish uppercase versions that make it 52 characters. Let’s also add numbers from 0-9, which brings the total character set to 62. Add punctuations and other characters on top with some whitespace (tab, spaces, line feed…) in total you are looking at 128 characters. Not sure if this is a coincidence but 128 is a power of 2, we always love numbers that are the power of 2! We can support 128 characters with just 7 bits.

Here is what our character to number mapping looks like.

'0' => 0
'1' => 1
...
'A' => 10
'B' => 11
..
'Z' => 35
... rest of the characters

And here is how we can use it to map those characters to bits with 7 digits. Since we are not using full-byte we are going to always set that leftover bit to 0.

0 = 0000 0000 (0)
... numbers 
9 = 0000 1001 (9)
A = 0100 1010 (10)
B = 0000 1011 (11)
...lowercase alphabet
Z = 0010 0011 (35)
... rest of charset

We created our own encoding scheme with the support of 128 characters. In fact, this is sort of reinventing the wheel as very similar encoding already exists and it is called ASCII.

ASCII

ASCII abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII, although they support many additional characters.

ASCII - American Standard Code for Information Interchange

Let’s define the basic elements of ASCII encoding. The name of the encoding is called ASCII, the size of the character set is 128 and the encoding rules are simple whatever number we assigned to the character in the ASCII table, convert it to binary and 7 bits will represent the encoded version of the character.

What is the downside of ASCII?

Well, everything is great, it is easy to understand and explain but the biggest problem is it does not cover much in terms of varieties we have as a “text”. How about accented languages, Chinese, French, German, Russian, Hindi, Arabic, and extra characters those alphabets have. We need to extend the character set considering we are already using 7 bits and wasting one bit by setting it to 0 all the time. That means we can add another 128 characters to our charset. Let me introduce you to Extended ASCII.

Extended ASCII

Extended ASCII is almost identical to ASCII but with more characters. Is it enough though? I don’t think so, we need more bits. Nothing changed in terms of encoding rules it’s just that we are extending the number of bits we are using to represent more characters.

Extended ASCII Table

I believe while these are happening there were lots of custom encoding schemes floating around to cover conversion between languages that do not map nicely with ASCII or Extended ASCII and that created a whole mess. I am not old enough to see those but it must be a dark era. ref: various encoding schemes

Unicode

Someone had enough with these encoding schemes and decide to fix it once in for all and came up with an encoding scheme that covers all of the languages in existence plus left room for future extension and that is Unicode. I hinted in the previous sentence that Unicode is a new encoding scheme and that is actually not true and one of the biggest misconceptions.

Unicode is just a standard that defines a gigantic table that maps characters to codepoints. Since codepoint values can get very big they are generally represented as hexadecimal. It does not tell anything about how to represent them, that’s the job of an encoding scheme that supports Unicode.

Letter A Unicode

Unicode character set defines more than 1 million characters which are more than enough to cover all characters and more. To represent 1 million characters though you need at least 3 bytes but let’s face it, 3 bytes can complicate things and we always like the power of 4 numbers so they decide to stick to 4 bytes to represent all.

UTF

UTF is the encoding scheme used for the Unicode standard. Depending on your need there are various types of UTF encodings, most of them are variable-length encodings. UTF-8 uses at least 1-byte if it is representing a small codepoint and can extend up to 4-bytes if necessary. This makes UTF-8 really handy especially if you are considering sharing this text over the network. The other is UTF-16 which decides to stick with at least 2-byte representation and can extend to 4 bytes. The other one which is a fixed-length encoding is UTF-32 basically every character will be represented as 4-bytes, not very space-efficient but simple to encode/decode.

All in all, when you see a bit sequence without any context you can not know if it is a number, photo, text, or something else. To make things more confusing, a character can be represented in different bit sequences depending on the encoding scheme is used to persist that. If you happen to decode the character with the wrong encoding scheme what you will get is a garbled representation. The reason being different encoding schemes have varying sizes of character sets they can represent and each uses different bit sequence sizes to encode them.