Ha.nnes.dev
Why ASCII is worse and when to use it
Also, announcing roc-ascii
2024-09-04
I recently released an ASCII library for the Roc language, but why? Isnât ASCII worse than UTF-8?
Why ASCII is worse than UTF-8
Limited to only ASCII characters:
Rocâs builtin
Str
type is encoded using UTF-8, which is designed to support all the worldâs writing systems, whereas ASCII only supports the letters of the English alphabet (A-Z and a-z), Arabic digits (0-9), some punctuation characters and 32 control characters (more on them later). If your data is user generated, you should probably use aStr
, especially if the string is a person or a placeâs name.
Why you might want to use ASCII
Simpler definition of a character:
When using UTF-8, the term âcharacterâ can be ambiguous, so âcodepointsâ and âgraphemesâ are used instead. Codepoints are the smallest unit of UTF-8 data, and unlike ASCII, where each character is encoded using one byte, UTF-8 is a variable-length encoding, so a single codepoint like the egg emoji (đ„) is encoded as four bytes. UTF-8 allows combining codepoints to form graphemes, which are the smallest units of meaning in a language. For example, the combining diaeresis codepoint (
âÌ
) combines with the codepoint before it to create a single grapheme, like in the string âRöcâ which contains 5 bytes, 4 codepoints and 3 graphemes. The woman technologist: medium-dark skin tone emoji (đ©đŸâđ»
) is a single grapheme made of 15 bytes and four codepoints, the woman emoji (đ©
), the medium-dark skin tone modifier (đŸ
), a zero width joiner (â
) and the laptop emoji (đ»
). ASCII avoids all of this complexity by not supporting characters outside the ASCII range.Ability to index directly into an ASCII string:
Because of codepoints like the combining characters, getting the nth character can be tricky. Getting the nth grapheme involves knowing how all the possible codepoints combine to create distinct graphemes which can be slow. Getting the nth codepoint can split a grapheme like ö into a lowercase letter o codepoint and a combining diaeresis codepoint (âÌ) which doesnât make sense. Because of UTF-8âs variable-length encoding, getting the nth byte might mean getting a byte from the middle of a codepoint, resulting in invalid UTF-8 data. Getting the nth character in an ASCII string is the same as getting the nth byte in the string.
Some functions are undefined for Utf-8 strings:
The string
â«ŰŁÙۧ ŰŁŰŰšâȘRocâą
(I love Rocâą) contains the invisible left-to-right embedding character (
U+202A
) to correctly show Latin letters in an Arabic string. If you tried to reverse that string, how would you handle the invisible embedding character? The answer is that the reverse function is undefined for Unicode strings. Similarly, itâs not possible to define uppercase and lowercase transformations for Unicode strings that are each otherâs inverse. This is because theı
character (lowercase dotless i) is normally uppercased toI
(capital i), and then lowercased toi
(lowercase dotted i), changing the character. However, when using the Turkish or Azerbaijani locales,I
is lowercased toı
. Some of this complexity is sidestepped when using ASCII, as it doesnât support any of these characters, but upper and lowercase functions arenât always well-defined when using ASCII. For example, in Dutch, the digraph IJ is treated like a single letter when changing case, so the wordijswafel
(ice-cream sandwich) should be capitalised asIJswafel
.
Except for the control characters
The first 32 ASCII characters are mostly non-printable characters
like the âend of transmission blockâ character or the âbellâ character
(which used to ring a physical bell on teleprinters). The most commonly
found control characters today are the ânullâ character which is used to
terminate strings in languages like C, the âhorizontal tabâ character
(\t
) which is displayed as horizontal whitespace, the âline
feedâ character (\n
) which starts a new line and the
âcarriage returnâ character (\r
) which returns to the start
of a line on UNIX systems and on Windows is used with the line feed
character to separate lines of text (\r\n
). All the control
characters can appear in both UTF-8 and ASCII strings, and can undermine
some of the benefits of ASCII mentioned earlier. For example, when
rendering the ASCII string abcâââ
in the terminal, it will
look like it contains three characters, but the extra ASCII bell
characters bring the total length to six.
TLDR: If you know that your data will only ever contain characters in the ASCII range, then using ASCII will probably be simpler. However, using ASCII doesnât remove all complexity from string handling, so you still need to be careful.