Debugging loadlibrary Through Space and Time
In 2017 Tavis Ormandy released loadlibrary, a "library that allows native Linux programs to load and call functions from a Windows DLL". As a showcase, the code included mpclient, a program that was capable of loading mpengine.dll
of Windows Defender and scan files for malware on Linux. This is an impressive feat: mpengine.dll
is a notoriously complex, ~20MB library that I tend to use to stress-test static analysis tools - getting it to actually execute on a different operating system is really something!
Unfortunately mpengine.dll
gets significant updates almost every month: with 5% of functions changing or unmatched in a library this size, we are talking thousands of changes monthly. It's no surprise mpclient has old unresolved issues about crashes with no easy fixes. While I know about private projects that successfully utilized loadlibrary for researching Defender and even other AV engines, the library was mostly unmaintained for years.
A couple of weeks ago however WaffleSec reported success with a recent mpengine.dll
version after they fixed some mock API's, sparking my interest again in the project. cube0x8, the author of a 64-bit fork (the original loadlibrary is 32-bit only) also entered the discussion and now there's a new PR promising support for mpengine.dll
again.
During the past days I worked on merging WaffleSec's changes to the 64-bit branch and encountered a particularly nasty bug that provided a great use-case for demonstrating the usefulness of loadlibrary: using Linux-based debugging tools to inspect code for Windows. While similar tools now are available for Windows too, I think the demonstration of modern debugging techniques by itself is also educational, and last but not least I secretly hope this post will bring in some more muscle for maintaining loadlibrary :)
Overview
From the loadlibrary README:
The peloader directory contains a custom PE/COFF loader derived from ndiswrapper. The library will process the relocations and imports, then provide a dlopen-like API.
With the program in memory the main problem to solve is providing it the interfaces it expect from Windows. In loadlibrary this is solved by mocking the Windows API: the peloader/winapi
directory of the project contains minimal implementations for methods exposed by standard DLL's like USER32, and these methods are used to populate the import table of the loaded Portable Executable. Since mpengine.dll
is "self-contained" (in fact, its large size is mostly the result of static linking) we can get away with e.g. not implementing different behaviors for different flags, or simply doing nothing in case of more complex requests, like ones for threading.
Since we don't expect our code to run in untrusted environments, memory is not randomized that helps analysis and debugging. In case of my tests:
- The .text section of
mpengine.dll
started at 0x75a101000 - mpclient's code was mapped at 0x55bf2bb99000
The Merge
So I merged cube0x8's and WaffleSec's branches, resolved conflicts and got a SEGFAULT on the first run:
Program received signal SIGSEGV, Segmentation fault.
0x00000000ffffde70 in ?? ()
(gdb)
The instruction pointer is clearly off in the wilderness, but fortunately we have a partial backtrace:
(gdb) bt
#0 0x00000000ffffde70 in ?? ()
#1 0x000000075a11e16b in ?? ()
#2 0x000000075ae21a88 in ?? ()
#3 0x00007fffffffe1f0 in ?? ()
#4 0x0000000000000000 in ?? ()
At #1 we see an address inside mpengine.dll
, so let's look at it in Ghidra (here's how to use Ghidra to debug loadlibrary with symbols for mpengine.dll
):
75a11e165 CALL qword ptr [->ADVAPI32.DLL::InitializeSecurityDescriptor]
75a11e16b TEST EAX,EAX
A quick grep shows that InitializeSecurityDescriptor
is not present in our mock API yet, so let's create it:
STATIC BOOL WINAPI InitializeSecurityDescriptor(
PVOID pSecurityDescriptor,
DWORD dwRevision
){
DebugLog("Returning success from InitializeSecurityDescriptor");
return 1;
};
DECLARE_CRT_EXPORT("InitializeSecurityDescriptor", InitializeSecurityDescriptor);
I just expose an empty function, as there is no one to check the security descriptor anyway. By iterating this process I ended up mocking some more API's. Some of my first attempts quickly killed mpclient as I forgot to include the WINAPI
macro in the declaration: this results in mpengine.dll
calling the import with a different calling convention (RCX, RDX, R8 R9, stack) than expected by my implementation (RDI, RSI, RDX, RCX, R8, R9, stack) causing quick and merciless segfaults.
Interestingly, it seems the newly implemented API's were not needed by WaffleSec when testing the 32-bit DLL. I spent quite some time trying to figure out why the two binaries behave differently, and found that the "TDT" component of mpengine.dll
does some pretty detailed platform detection which can explain divergent code paths on different architectures, but I didn't identify the point of divergence: my mock API's worked well enough and I encountered a much more worrying bug.
The Bug
This is how our little bug looked like:
Receiived signal SIGSEGV, Segmentation fault.
0x000000075a12bb96 in ?? ()
(rr) x/8i $rip
=> 0x75a12bb96: mov rcx,QWORD PTR [rax]
0x75a12bb99: call 0x75a12b5a8
0x75a12bb9e: mov DWORD PTR [rbx+0x50],eax
0x75a12bba1: mov rax,QWORD PTR [rbx+0x8]
0x75a12bba5: lea r8,[rip+0xd0fd06] # 0x75ae3b8b2
0x75a12bbac: lea rdx,[rip+0xd0fcfe] # 0x75ae3b8b1
0x75a12bbb3: mov rcx,QWORD PTR [rax]
0x75a12bbb6: call 0x75a12b5a8
(rr) i r
rax 0x69006e0075002f 29555345008689199
rbx 0x7fffdc0c5bc0 140736885185472
rcx 0x55bf6ac0c640 94280618133056
rdx 0x75ae3b8a9 31589644457
rsi 0x0 0
Uh-oh, it seems a wide-char string overwrote a pointer, a bad case of memory corruption! Have I miscalculated some bounds somewhere? My first hunch was to see what the corrupting string was so I may be able to pinpoint the source of the corruption. This wasn't really useful, because:
- The string turned out to be the URL of one of Defender's many telemetry services
- The string was cut in half by a NULL byte, indicating that I may be looking at a late state of the original corruption.
Looking at Ghidra's disassembly I also realized I'm neck-deep in the regex engine of the boost library, with no suspicious Windows API calls in sight (only 1-2 functions are visible in any given stack trace for some reason I haven't looked into). This seemed like a lot of trouble and I spent at least a full a day investigating different dead ends I can't recall anymore. Then I remembered: the purpose of loadlibrary is to enable the use of analysis tools, so why not start use one (or two) already?!
Since I suspected heap corruption, my first tool of choice was AddressSanitizer (ASAN): our mock Windows API invokes plain old malloc()
in place of HeapAlloc
&co. so we can instrument and monitor memory allocations even inside mpengine.dll
!
To my surprise, ASAN didn't catch anything: the crash occurred at exactly the same place without any prior indication of heap corruption. But! The memory layout turned quite different:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff335eb96 in ?? ()
(gdb) i r
rax 0xbebebebebebebebe -4702111234474983746
rbx 0x7fffffffcc10 140737488342032
rcx 0x61400002aa40 106927505975872
rdx 0x7ffff406e8a9 140737287481513
rsi 0x0 0
...
(gdb) x/16x $rcx
0x61400002aa40: 0x00000000 0x00000000 0x00000000 0x00000000
0x61400002aa50: 0x00000000 0x00000000 0xbebebebe 0xbebebebe
0x61400002aa60: 0xbebebebe 0xbebebebe 0xbebebebe 0x00000000
0x61400002aa70: 0xbebebebe 0xbebebebe 0xbebebebe 0xbebebebe
I'm showing RCX because it points to the object where the values are copied from. Preceding code looks like this:
undefined __thiscall basic_regex_creator<> * __thiscall
boost::re_detail_500::basic_regex_creator<>::basic_regex_creator<>
(basic_regex_creator<> *this,regex_data<> *param_1)
assume GS_OFFSET = 0xff00000000
undefined <UNASSIGNED> <RETURN>
basic_regex_cr RCX:8 (auto) this
regex_data<> * RDX:8 param_1
undefined8 Stack[0x8]:8 local_res8 XREF[1]: 75a12bb28(W)
75a12bb28 48 89 4c MOV qword ptr [RSP + local_res8],this
24 08
75a12bb2d 53 PUSH RBX
75a12bb2e 48 83 ec 20 SUB RSP,0x20
75a12bb32 48 8b d9 MOV RBX,this ; RBX := RCX (basic_regex_creator)
75a12bb35 48 89 11 MOV qword ptr [this],param_1 ; Save regex_data ptr (RDX) to this
75a12bb38 48 8b 42 18 MOV RAX,qword ptr [param_1 + 0x18] ; RAX := [regex_data + 0x18]
75a12bb3c 48 89 41 08 MOV qword ptr [this + 0x8],RAX ; Save regex_data+0x18 to this
75a12bb40 33 d2 XOR param_1,param_1
75a12bb42 48 89 51 10 MOV qword ptr [this + 0x10],param_1
; ... further object initialization ...
75a12bb6d 48 8b 09 MOV this,qword ptr [this] ; Replace this (RCX) with saved regex_data
75a12bb70 48 8b 81 MOV RAX,qword ptr [this + 0x160]
60 01 00 00
75a12bb77 48 89 81 MOV qword ptr [this + 0x168],RAX
68 01 00 00
75a12bb7e 48 8b 03 MOV RAX,qword ptr [RBX]
75a12bb81 89 50 2c MOV dword ptr [RAX + 0x2c],param_1
75a12bb84 48 8b 43 08 MOV RAX,qword ptr [RBX + 0x8] ; RAX is regex_data+0x18 restored from the original basic_regex_creator object
75a12bb88 4c 8d 05 LEA R8,[u+6]
1b fd d0 00
75a12bb8f 48 8d 15 LEA param_1,[w]
13 fd d0 00
75a12bb96 48 8b 08 MOV this,qword ptr [RAX] ; CRASH when reading regex_data+0x18
The 0xbe
bytes indicate uninitialized memory with ASAN instrumentation. This means that we don't overwrite memory, but use it uninitialized (and wide chars are just leftover trash)! Uninitialized memory is even more fun to debug, because you are hunting for something that didn't happen :) While it wouldn't directly solve our problem, it would be nice to at least see when those uninitialized bytes were allocated.
This is where rr comes to save the day! rr is a time-travel debugger for Linux, that records (among other things) all memory accesses, and allows us to go back in time to investigate any crimes. This also means that while we are in the replay even heap allocations become predictable so I set a conditional breakpoint for the mocked HeapAlloc()
call that breaks only when the resulting buffer is allocated at the same address where the offending regex_data
object is observed at the time of crash.
After 12 hits we get our crash again, but this time we can go back to the last relevant(!) HeapAlloc()
call and take a look at the backtrace:
(rr) b HeapAlloc if dwBytes==408
...
Program received signal SIGSEGV, Segmentation fault.
0x000000075a12bb96 in ?? ()
(rr) reverse-continue
Continuing.
Breakpoint 1, HeapAlloc (hHeap=0x48454150, dwFlags=0, dwBytes=408) at winapi/Heap.c:35
35 if (dwFlags & HEAP_ZERO_MEMORY) {
(rr) finish
Run till exit from #0 HeapAlloc (hHeap=0x48454150, dwFlags=0, dwBytes=408) at winapi/Heap.c:35
0x000000075a7746cc in ?? ()
Value returned is $1 = (void *) 0x55bf6ac0c640
(rr) bt
#0 0x000000075a7746cc in ?? ()
#1 0x0000000048454150 in ?? ()
#2 0x0000000000000000 in ?? ()
(rr)
Carefully single-stepping from here we end up in a constructor of regex_data
(in this particular case we could get here by the static results too, but inheritance can make things tricky, esp. if we don't have symbols that is frequently the case with mpengine.dll
):
regex_data<> * __thiscall boost::re_detail_500::regex_data<>::regex_data<>(regex_data<> *this)
{
regex_traits_wrapper<> *prVar1;
LCID local_res10 [2];
regex_traits_wrapper<> *local_res18;
*(undefined8 *)this = 0;
*(undefined8 *)(this + 8) = 0;
*(undefined8 *)(this + 0x10) = 0;
prVar1 = (regex_traits_wrapper<> *)operator_new(0x10);
local_res18 = prVar1;
local_res10[0] = GetUserDefaultLCID();
object_cache<>::get((ulong *)prVar1,(__uint64)local_res10);
std::shared_ptr<>::shared_ptr<><>((shared_ptr<> *)(this + 0x18),prVar1);
*(undefined8 *)(this + 0x28) = 0;
*(undefined8 *)(this + 0x30) = 0;
*(undefined8 *)(this + 0x38) = 0;
*(undefined8 *)(this + 0x40) = 0;
*(undefined8 *)(this + 0x48) = 0;
*(undefined4 *)(this + 0x50) = 0;
memset(this + 0x54,0,0x100);
*(undefined4 *)(this + 0x154) = 0;
*(undefined8 *)(this + 0x168) = 0;
*(undefined8 *)(this + 0x160) = 0;
*(undefined8 *)(this + 0x158) = 0;
*(undefined4 *)(this + 0x170) = 0;
*(undefined8 *)(this + 0x178) = 0;
*(undefined8 *)(this + 0x180) = 0;
*(undefined8 *)(this + 0x188) = 0;
*(undefined2 *)(this + 400) = 0;
return this;
}
Now that this+0x18
reference looks awfully familiar from basic_regex_creator
, and we even see a call to an external API! Here's the mock code:
STATIC DWORD GetUserDefaultLCID()
{
//value of LOCALE_USER_DEFAULT
DebugLog("");
return 0x0400;
}
Do you see? No? That's cool, I didn't see it either. This method is awfully boring, so let's look at some assembly instead:
undefined __thiscall regex_data<>(regex_data<> * this)
assume GS_OFFSET = 0xff00000000
undefined <UNASSIGNED> <RETURN>
regex_data<> * RCX:8 (auto) this
75a12bc2c 48 89 4c MOV qword ptr [RSP + local_res8],this
24 08
75a12bc31 53 PUSH RBX
75a12bc32 55 PUSH RBP
75a12bc33 56 PUSH RSI
75a12bc34 57 PUSH RDI
75a12bc35 48 83 ec 28 SUB RSP,0x28
75a12bc39 48 8b f1 MOV RSI,this
...
75a12bc59 ff 15 11 CALL qword ptr [->KERNEL32.DLL::GetUserDefaultLCID]
51 cf 00
...
75a12bc74 48 8d 4e 18 LEA this,[RSI + 0x18]
75a12bc78 e8 2b 01 CALL std::shared_ptr<>::shared_ptr<><>
00 00
The pointer (to offset 0x18 from this
) is passed to shared_ptr
via some arithmetic on RSI
. This register is preserved during WINAPI
calls, as we can see in the prologue of this well behaved function, but GetUserDefaultLCID()
is not defined as WINAPI
! Funnily enough, the clobbered value we get after the API call is also a writable memory address, so shared_ptr
doesn't crash, just writes to some unrelated place (also not causing any problems), leaving our object field at +0x18 uninitialized.
Adding the WINAPI
macro to the declaration solves this problem - just like it fixed the much more obvious bugs in my new API's a couple of days back...
The Conclusion
If I had to give you one takeaway it'd be this: If you find a bug, look for its relatives!
It's quite frustrating to see such an easy to spot bug causing such a mess, but at least I got to sharpen my debugging skills and train my muscle memory with rr, a tool that I think doesn't get the appreciation it deserves.
If you are interested in loadlibrary we still have a fresh list of API's to mock, providing a great first task for new contributors!