Character Encodings
Computers only understand bits, right? Then how can they display and work with text? Let's find out!
Lecture material
Recommended reading
- Character encodings for beginners
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- ascii-code.com
- Don't learn anything by heart, but look at the table at least once.
Character Encodings
Computer programs are pretty good at displaying text on the screen. A good example of that is this text you are currently reading 😀 But even though computers are good at displaying text on the screen, they can't actually store any text/characters in their memory, only numbers (or rather combinations of bits that represent different numbers). So how do they deal with this problem? Well, each type of character (a
, b
, c
etc.) is mapped to a number (like 97
, 98
, 99
etc.), and they store the number the characters is mapped to. Then when the character needs to be printed on the screen, the computer will print the character the number represents. An example of this is shown in below.
An interesting question is: Which characters are mapped to which numbers? Unfortunately, the answer to this question is a bit more complicated than it would need to be these days, and this is also the reason why programmers needs to be aware of character encodings at all.
Charset VS Character Encoding
A charset (also known as character set) is a set of characters. For example, we could define the charset plocko
as the set of the characters a
, b
and c
.
A character encoding is the mapping of the characters in a charset into numbers. For example, the plocko
charset could be mapped according to the table shown in in below.
Number | Character |
---|---|
1 | a |
2 | b |
3 | c |
For one charset, there can exist multiple different character encodings. For example, the plocko
charset could be mapped according to the table shown in in below instead.
Number | Character |
---|---|
1 | b |
2 | a |
3 | c |
It's always important to use the correct name on things to avoid confusion. Unfortunately, many programmers treat charsets and characters encoding as one and the same things. For example, in Java, the StandardCharsets class exists, but it rather defines character encodings. It's the same with the charset
header in HTTP; it specify which character encoding to use, and not the charset.
Please contribute to a better world by using the correct name on things, instead of making it harder for the rest of us.
Not a big problem
In practice it's not a big deal that some confuse charsets with character encodings, and vice versa, because most often each charset only have one character encoding.
The problem with character encodings
Each time you write some text, it will be written using a character encoding, so the computer knows which number each character you type should be mapped to. If you store the text (the numbers your typed characters have been mapped to) in a file and later wants to open the file to read the text, it's crucial that the computer uses the same character encoding to map the numbers back into characters, otherwise the wrong characters will be displayed on the screen after opening the file. This problem also exists when you send a file or a mail to a friend.
So it's crucial that one and the same text uses the same character encoding all the time. Unfortunately, there exists many different character encodings, and often the text/file itself doesn't contain any information about which character encoding that was used to create it, so sometimes files are opened with the wrong character encoding, and it displays characters that don't make sense. A common way to discover that the wrong character encoding has been used somewhere is that the text contains the �
symbol or simply is unreadable.
ASCII
As computers have evolved, so have different character encodings. The earliest de facto standard character encoding was American Standard Code for Information Interchange (ASCII). In this encoding, each character is mapped to a 7 bits number, so it can map 2⁷ = 128 different characters to a number (between 0 and 127). The table in below shows the mapping.
Number | Character |
---|---|
0-31 | Control characters (not displayed on the screen but used to control how machines works (kind of)). |
32 | (space) |
33 | ! |
34 | " |
35 | # |
36 | $ |
37 | % |
38 | & |
39 | ' |
40 | ( |
41 | ) |
42 | * |
43 | + |
44 | , |
45 | - |
46 | . |
47 | / |
48 | 0 |
49 | 1 |
50 | 2 |
51 | 3 |
52 | 4 |
53 | 5 |
54 | 6 |
55 | 7 |
56 | 8 |
57 | 9 |
58 | : |
59 | ; |
60 | < |
61 | = |
62 | > |
63 | ? |
64 | @ |
65 | A |
66 | B |
67 | C |
68 | D |
69 | E |
70 | F |
71 | G |
72 | H |
73 | I |
74 | J |
75 | K |
76 | L |
77 | M |
78 | N |
79 | O |
80 | P |
81 | Q |
82 | R |
83 | S |
84 | T |
85 | U |
86 | V |
87 | W |
88 | X |
89 | Y |
90 | Z |
91 | [ |
92 | \ |
93 | ] |
94 | ^ |
95 | _ |
96 | ` |
97 | a |
98 | b |
99 | c |
100 | d |
101 | e |
102 | f |
103 | g |
104 | h |
105 | i |
106 | j |
107 | k |
108 | l |
109 | m |
110 | n |
111 | o |
112 | p |
113 | q |
114 | r |
115 | s |
116 | t |
117 | u |
118 | v |
119 | w |
120 | x |
121 | y |
122 | z |
123 | { |
124 | | |
125 | } |
126 | ~ |
127 | DELETE (control character) |
Although ASCII contains most characters Americans use and can be used within America without problem, it's not optimal for the rest of the world. For example, here in Sweden we also have the characters å
, ä
, ö
, Å
, Ä
and Ö
in our alphabet, but we can't use these characters in this character encoding. We usually solved that by using a
, o
, A
and O
instead, and hoped the reader would understand that we actually meant the other characters, but it was not a particular good solution.
Computers usually work in units of bytes (8 bits), so characters stored in ASCII usually wastes one bit. With that extra bit, 128 additional characters could be used (mapped to numbers 128 - 255). To use that extra bit, new character encodings were created that extended ASCII, meaning that the numbers 0-127 mapped to the same characters as in ASCII, but the numbers 128-255 are mapped to entirely new characters.
ISO Latin-1
ISO Latin-1 is one of the character encodings that extends ASCII. It's commonly used in Europe since it contains some extra characters that are used in many European countries. below shows some of these extra characters.
Number | Character |
---|---|
0-127 | Same as in ASCII. |
163 | £ |
196 | Ä |
197 | Å |
214 | Ö |
220 | Ü |
223 | ß |
228 | ä |
229 | å |
246 | ö |
This character encoding has been heavily used here in Sweden, because we can use all of our special characters in it.
ISO Latin-2
ISO Latin-2 is an example of another character encoding extending ASCII. below shows some of the extra characters it adds.
Number | Character |
---|---|
0-127 | Same as in ASCII. |
163 | Ł |
196 | Ä |
197 | Å |
214 | Ö |
220 | Ü |
223 | ß |
228 | ä |
229 | å |
246 | ö |
Example
As you can see, ISO Latin-2 is quite similar to ISO Latin-1. One difference is that the number 163 maps to different characters. So if one would save the text Ä£Ö
in ISO Latin-1 and then open it in ISO Latin-2, it would display ÄŁÖ
! This is a good example of why it's important to save the text and open it using the same character encoding.
Other extensions to ASCII
In addition to ISO Latin-1 and ISO Latin-2, there exists many other character encodings that extends ASCII. One of them is Windows-1252, another one is Windows-1250, etc.
Unicode
It is quite hard to work with text when you need to keep track of which character encoding to use. The Unicode project was started to solve this problem. It defines a charset that contains as good as all characters in the world, as well as several different character encodings that maps the characters to numbers.
UTF-32
In the character encoding UTF-32, each character is mapped to a 32 bit (4 bytes) number. This makes it easy to understand how it works, but each character takes 4 times more space compared to ASCII. Therefore this character encoding is rarely used.
UTF-8
In the character encoding UTF-8, each character is mapped to 8 bits (1 byte), 16 bits (2 bytes), 24 bits (3 bytes) or 32 bits (4 bytes). The characters used in ASCII (the most commonly used characters) are mapped to 8 bits the very same way as in ASCII, so UTF-8 is backward compatible with ASCII (you can save a text in ASCII and then open it in UTF-8 and read it correctly). The less commonly used characters are mapped to 16, 24 or 32 bits. This way the size of the text is still quite small (since the most common characters only takes 8 bits), and it's still possible to use all the less commonly used characters in the text.
The downside with UTF-8 compared to UTF-32 is that the text is a bit harder to process, since each character is implemented with either 1, 2, 3 or 4 bytes. For example, to figure out how many characters a string in UTF-32 contains, you just need to divide the number of bytes in it by 4. In UTF-8 you need to go through the string byte by byte and count each character you come across, which takes more time.
Which encoding to use?
These days you should most often use UTF-8. It's the default encoding used in more and more applications. Optimally, everyone would use the same character encoding, but today there still exists many old applications that use ASCII or one of the character encodings extending ASCII, so you still need to be aware of character encodings when you work with those applications.