Ha.nnes.dev
Fuzzing is fun
I tried fuzz testing and found a crash
2024-09-27
The grapheme splitting problem
I’ve written
before about how complex Unicode can be. For example, to split a
string into its graphemes (the smallest visual units of a string),
requires handling thousands of edge cases. Even though the
🫱🏾🫲🏻
emoji (Handshake: Medium-Dark Skin Tone, Light Skin
Tone) appears as one grapheme, it’s made of five codepoints
(🫱
the rightwards hand, 🏾
a medium-dark skin
tone modifier, an invisible zero width joiner, 🫲
a
leftwards hand and 🏻
a light skin tone modifier). The
Unicode specification goes into excruciating
(but necessary) detail about how this grapheme splitting algorithm
should work.
Roc’s Unicode
package implements functions for UTF-8 strings, including the
Grapheme.split
function. The current implementation manages
to pass all the official
Unicode tests, but there’s still a branch in the code where the
function is hardcoded to crash. One of the authors, Luke Boswell reached
out to me to try and find a test case that would trigger this code
branch so that it could be implemented properly.
Finding the crash
One way to find bugs in situations like this is fuzzing, which means sending semi-random input to your program until it crashes. Fuzzers can use existing input examples to generate new inputs, or analyse your code to craft inputs that are more likely to cause crashes.
I used the radamsa
fuzzer, which takes some example inputs and mutates using predefined
rules. Simple mutations might be things like mutating characters, adding
whitespace or repeating data. A more involved mutation could take the
example input beans123
and change the number to try and
crash a number parser by changing it to the maximum value of an unsigned
64-bit integer, so the output string would be
beans18446744073709551616
.
I used the existing Unicode test suite as input examples for the
fuzzer, these test cases include tricky edge cases like ਃ가
(a Gurmukhi combining diacritic followed by a Hangul syllable). I wrote
a short Roc app that just runs Grapheme.split
on its input
and set up a loop that pipes the fuzzer output to the Roc app and dumps
any errors to a text file. After an hour of running I managed to get 30
test cases that crashed the Roc app. Some of the test cases were short
like ใ␁ᅠऀ
(A Thai character, an invisible start of heading
character, a Hangul filler character, an invisible zero width joiner and
a Devanagari diacritic). Other test cases were megabytes long, clearly
radamsa has a mutation that tries to send way more input than the
program is expecting. 😅 Looking over all the new test cases, they were
all situations where unexpected things were being joined with a zero
width joiner character. I opened an issue on the
roc-unicode
repo explaining my findings.
What next?
Roc doesn’t have an integrated fuzzer yet, so I had to use a generic fuzzer that could communicate through stdin/stdout or through files, but other languages do have more integrated fuzzers. The more advanced fuzzers can analyse which code paths are being taken, and craft specific inputs to explore your code more thoroughly. For more information about fuzzing, check out this episode of the Software Unscripted podcast with Brendan Hansknecht, it’s fascinating how advanced this technique can get.