LLVM LibFuzzer is an in-process, coverage-guided, evolutionary fuzzing engine. It feeds a series of fuzzed inputs via a user-provided "target" function and attempts to trigger crashes, memory bugs and undefined behavior.
A fuzzer is a program that consists of the aforementioned target function linked against the LibFuzzer runtime. The LibFuzzer runtime provides the entry-point to the program and repeatedly calls the target function with generated inputs.
Support for LibFuzzer is implemented as a compiler flag in Clang. WiredTiger's build configuration checks whether the compiler in use supports -fsanitize=fuzzer-no-link
and if so, elects to build the tests.
Compiling with Clang's Address Sanitizer isn't mandatory but it is recommended since fuzzing often exposes memory bugs.
Each fuzzer is a program under build/test/fuzz/
. WiredTiger provides the test/fuzz/fuzz_run.sh
script to quickly get started using a fuzzer. It performs a limited number of runs, automatically cleans up after itself in between runs and provides a sensible set of parameters which users can add to. For example:
In general the usage is:
Each fuzzer will produce a few outputs:
crash-<input-hash>
: If an error occurs, a file will be produced containing the input that crashed the target.fuzz-N.log
: The LibFuzzer log for worker N. This is just an ID that LibFuzzer assigns to each worker ranging from 0 => the number of workers - 1.WT_TEST_<pid>
: The home directory for a given worker process.WT_TEST_<pid>.profraw:
If the fuzzer is running with Clang coverage (more on this later), files containing profiling data for a given worker will be produced. These will be used by fuzz_coverage
.LibFuzzer is a coverage based fuzzer meaning that it notices when a particular input hits a new code path and adds it to a corpus of interesting data inputs. It then uses existing data in the corpus to mutate and come up with new inputs.
While LibFuzzer will automatically add to its corpus when it finds an interesting input, some fuzz targets (especially those that expect data in a certain format) require a corpus to start things off in order to be effective. The fuzzer fuzz_config
is one example of this as it expects its data sliced with a separator so the fuzzing engine needs some examples to guide it. The corpus is supplied as the first positional argument to both fuzz_run.sh
and the underlying fuzzer itself. For example:
Creating a fuzzer with LLVM LibFuzzer requires an implementation of a single function called LLVMFuzzerTestOneInput
. It has the following prototype:
When supplied with the -fsanitize=fuzzer
flag, Clang will use its own main
function and repeatedly call the provided LLVMFuzzerTestOneInput
with various inputs. There is a lot of information online about best practices when writing fuzzing targets but to summarize, the requirements are much like those for writing unit tests: they should be fast, deterministic and stateless (as possible). The LLVM LibFuzzer reference page is a good place to start to learn more.
WiredTiger has a small fuzz utilities library containing common functionality required for writing fuzz targets. Most of the important functions here are about manipulating fuzz input into something that targets can use with the WiredTiger API.
If the function that a target is testing can accept a binary blob of data, then the target will be straightforward as it'll more or less just pass the input to the function under test. But for functions that require more structured input, this can pose a challenge. As an example, the fuzz_config
target requires two inputs that can be variable length: a configuration string and a key to search for. In order to do this, the target can use an arbitrary separator sequence to split the input into multiple parts with the following API:
Using this API, the target can supply a data buffer, a separator sequence and the number of inputs it needs. If it doesn't find the right number of separators in the provided input, fuzzutil_sliced_input_init
will return false
and the target should return out of the LLVMFuzzerTestOneInput
function. While this may seem like the target will reject a lot of input, the fuzzing engine is able to figure out (especially if an initial corpus is supplied), that inputs with the right number of separators tend to yield better code coverage and will bias its generated inputs towards this format.
In fuzz_config
, we use the | character as a separator since this cannot appear in a configuration string. So an input separated correctly will look like this:
But for data where there is no distinct character that can be used as a sentinel, we can provide a byte sequence such as 0xdeadbeef
. So in that case, a valid input may look like this:
After implementing a new fuzzing target, developers typically want to validate that it's doing something useful. If the fuzzer is not producing failures, it's either because the code under test is robust or the fuzzing target isn't doing a good job of exercising different code paths.
In order to view code coverage information, the build will need to be configured with the -fprofile-instr-generate
and -fcoverage-mapping
flags to tell Clang to instrument WiredTiger with profiling information. It's important that these are added to both the CFLAGS
and LINKFLAGS
variables.
Build and invoke fuzz_run.sh
for the desired fuzzer as described in the section above.
After running the fuzzer with Clang coverage switched on, there should be a number of profraw
files in the working directory.
Those files contain the raw profiling data however, some post-processing is required to get it into a readable form. WiredTiger provides a script called fuzz_coverage.sh
that handles this. It expects to be called from the same directory that the fuzzer was executed in.
In general the usage is:
The fuzz_coverage.sh
script produces a few outputs:
<fuzz-test-binary>_cov.txt:
A coverage report in text format. It can be inspected with the less
command and searched for functions of interest. The numbers on the left of each line of code indicate how many times they were hit in the fuzzer.<fuzz-test-binary>_cov.html:
A coverage report in html format. If a web browser is available, this might be a nicer way to visualize the coverage.The fuzz_coverage.sh
script uses a few optional environment variables to modify its behavior.
PROFDATA_BINARY:
The binary used to merge the profiling data. The script defaults to using llvm-profdata
.COV_BINARY:
The binary used to generate coverage information. The script defaults to using llvm-cov
.For consistency, the script should use the llvm-profdata
and llvm-cov
binaries from the same LLVM release as the clang
compiler used to build with. In the example above, clang-8
was used in the configuration, so the corresponding fuzz_coverage.sh
invocation should look like this: