Dynamically Toggling Binary Instrumentation Sites at Runtime

Why this exists

This is not a particularly happy story.

A couple of years ago I worked with someone on a bit of binary-fuzzing-related stuff, part commercial, part research. The commercial side quietly died because of collaboration issues. On the research side, I did manage to build some working pieces, but the process was miserable and the outcome never became something I genuinely wanted to keep pushing toward a paper.

After that collaboration ended, I considered spinning the work into a standalone paper. The more I looked at it, though, the less appropriate that felt. The application was incomplete, the research flavor was not especially strong, and in some scenarios the approach could even produce incorrect patches. So instead of forcing it into a half-baked paper, I decided to write it up as a technical blog post and at least preserve the idea, the implementation, and the pitfalls before I forgot how any of it worked.

What this post is really about is finding sanitizer-related checks inside the final binary, disabling the corresponding machine code, and restoring it later when needed. It is not a mature technique I would recommend deploying everywhere with confidence, but there are a few implementation details in it that I still think are worth documenting.

The hard parts

At first glance, disabling a sanitizer check sounds easy: find the relevant __asan_report calls and NOP them out. In practice, it is nowhere near that simple. You first have to separate the machine code inserted by the sanitizer from the program’s actual logic. Then, after optimization, instruction selection, and linking are all done, you have to find that exact region again inside the final ELF. And finally, you cannot just chop out a few bytes at random, because if you cut an x86 instruction in half, the program dies on the spot.

My first thought was to cheat a little and recover the final byte range from things like function:line, debug info, or even some IR location metadata. That idea fell apart quickly. The optimizer reorders things, instruction selection expands things, and the linker keeps changing the layout. If you want stable positioning in the final binary, you cannot rely on “it used to be roughly here.” You need to leave behind a marker you can recognize later.

So I ended up with this workflow: identify the checks during compilation, plant a marker I can recover later, and then, once the final binary is produced, walk back from that marker to locate the machine code and patch the corresponding file offsets. The first half is done by an LLVM pass; the second half is handled by a small post-link tool.

IR recognition

The timing of the pass matters a lot: it has to run after the sanitizer pass. Only then have ASan and UBSan already inserted their checking logic into the IR, which gives the later analysis something concrete to inspect.

The most direct starting point is, naturally enough, the sanitizer failure path. The basic check looks like this:

if (name.startswith("__ubsan_handle") ||
    name.startswith("__asan_report")) {
    return true;
}

In other words, if a basic block eventually calls __ubsan_handle* or __asan_report*, it is first treated as a failure block. But just knowing the failure block is not enough. What I actually want to disable is not merely the final call __asan_report...; it is the entire slice of code that exists only to perform this check: the preceding comparison, the branch, the helper calculations inside the failure block, and even the data dependencies that exist solely to support the condition.

In practice, the analysis starts from those failure blocks and walks backward. Instructions in the failure block are pulled in first. Then the conditional branches that lead into those blocks are pulled in as well. If an instruction is used only by other instructions that have already been classified as checking logic, it gets absorbed too. If every instruction in a block gets absorbed, then the pass keeps walking backward through that block’s predecessors.

The most delicate part here is avoiding normal program logic. If an IR instruction serves both the sanitizer check and the program’s real behavior, it cannot be included. The analysis only continues to absorb an instruction when its sole purpose is to serve that check. That step matters a lot because it determines how large the eventual patch region becomes. If you relax this rule, it becomes very easy to wipe out code that actually belongs to user logic.

In the end, the compiler attaches a shared metadata tag to all of the IR instructions that were marked this way. At that point the compiler already knows which pieces are check logic and which are not, but the binary still has no addresses attached to them.

Planting markers

The mechanism itself is simple: insert one call at the beginning of every contiguous check region and pass it a random integer:

__mark_check(random_id);

At the same time, write that random integer and its corresponding source location into a mapping file that looks roughly like this:

1804289383 = parse_config:117
846930886 = decode_header:52

This file is just a ledger. It ties the random number to a source location. The real binary position gets reconstructed only after linking.

The marker function

If you just drop in an empty helper call, the compiler will probably optimize it away. So this helper has to satisfy two constraints: it needs to survive into the final binary, and it must not change the program’s semantics. Ideally it should also leave behind a stable constant near the call site. The implementation looks roughly like this:

__attribute__((noinline, optnone))
void __mark_check(int x) {
    if ((((x ^ -1) * x) & 1) == 0) {
        return;
    }
    __builtin_unreachable();
}

That condition is always true, because either x or ~x must be even, which means their product always has a low bit of 0. So semantically the function always just returns, but from the compiler’s perspective it is not a totally trivial shell that can be eliminated at a glance.

More importantly, under the x86_64 SysV ABI, the first int argument is passed in edi. That makes the call site naturally grow into something like this:

bf 67 45 23 01    mov    edi, 0x01234567
e8 xx xx xx xx    call   __mark_check_xxx

That immediate inside mov edi, imm32 is the random number deliberately planted earlier. That is the signature used later during relocation.

Relocation

After linking, the tool first performs a preprocessing pass. The inputs are just two things: the mapping file emitted during compilation, and the final linked ELF executable. The program locates the .text section, retrieves its file offset, size, and virtual-address base, and then performs all scanning only inside that code section so it does not accidentally mistake matching byte patterns in data sections for real code.

From there, it can linearly scan .text for this pattern:

0xBF <imm32> 0xE8 <rel32>

That is, mov edi, imm32; call rel32. On its own, that pattern can definitely collide with unrelated program code, because a normal binary may also happen to load an immediate into edi and then perform a call. So I add two more checks. First, match imm32 against the random numbers recorded in the mapping file. Then resolve the call target and verify that it points to the marker function in the symbol table. Only when both checks succeed do I accept the location as a genuine marker.

Once that works, a source-level check can be mapped back onto a specific location in the final binary. For convenience in the following stages, I store the result in a json file. In addition to the random number and source location, this file gradually accumulates the final virtual address, file offset, and related metadata.

Patch boundaries

Finding the marker is only the beginning. The truly annoying problem is deciding how many bytes to patch.

Patching only the call __mark_check instruction is obviously not enough, because the expensive part is the sanitizer logic that follows. But chopping out a fixed number of bytes is also unsafe, because it is far too easy to cut an x86 instruction in the middle. What I use right now is a rather lazy heuristic that is good enough for the current prototype: start from the marker, disassemble forward, stop at the first callq after it, and then record the entire range starting at mov edi, imm32 and ending at that callq.

So the end result is not a point but a full range like this:

bf 67 45 23 01    mov    edi, 0x01234567
e8 .. .. .. ..    call   __mark_check_xxx
...               ; comparisons, branches, helper calculations
e8 .. .. .. ..    call   __asan_report_load8

That whole range is written into the json file. Alongside the source location, final virtual address, and file offset, I also record the total length of the region and the original bytes themselves. Those bytes are first encoded as base64, which makes restoration easy later: no recompilation is needed, the original content can just be written back verbatim.

Rewriting and restoring

Once that json file exists, the remaining work is fairly straightforward. To disable a chosen set of sites, replace every byte in the corresponding ranges with 0x90, which is NOP. To restore them, decode the saved original bytes and write them back exactly as they were.

There is one small gotcha here: range checks are currently performed against file offsets, not runtime virtual addresses. So if the interface simply calls this thing an “address,” that is not entirely honest and still needs cleanup later. Another thing worth clarifying is that the so-called “runtime toggling” here is closer to switching target binaries between fuzzing rounds than to actual self-modifying code inside a live process. What it really solves is an execution-strategy question: before the next run starts, do I want this batch of sanitizer checks to exist or not?

Limitations

This design already works, but it is still very obviously a research prototype. Right now it is basically hard-wired to x86_64 ELF, and it assumes the mov edi, imm32; call rel32 shape will remain stable. If the binary is stripped too aggressively and the symbol-table entry for the marker function disappears, relocation becomes much harder. Patch boundaries are currently closed off by “the first callq that follows,” which is, at the end of the day, just a heuristic. __ubsan_handle* is also currently treated wholesale as a failure path, so recoverable UBSan checks are not distinguished yet. The mapping file is still append-only, so if you run the tooling repeatedly during development it is best to clean it up manually now and then. And because it only records function:line, incomplete debug info also weakens the mapping.