by Mitja Kolsek, the 0patch Team
March 2021 Windows Updates included fixes for seven vulnerabilities in Windows DNS Server, two of which were marked by Microsoft as "Exploitation More Likely": CVE-2021-26877 and CVE-2021-26897. They were not known to be exploited and no details were publicly available until security researchers Eoin Carroll and Kevin McGrath published their analysis on McAfee Labs blog. Their article included enough information for us to reproduce both vulnerabilities, and then create micropatches for them.
This article will be about CVE-2021-26897, while CVE-2021-26877 will be covered in a parallel article.
These two vulnerabilities were the first we have ever analyzed with the help of Tetrane REVEN, an incredibly powerful reverse engineering tool that allows you to record a virtual machine and then browse or search through all recorded instructions in all processes and the kernel, or taint any data value forward or back in time, across processes and between user and kernel space (plus much more). I wanted to use this opportunity to show how REVEN helped us perform these analyses, which would otherwise have been done with WinDbg and countless re-launching of dns.exe process, having all interesting objects bouncing around on different memory addresses every time.
CVE-2021-26897 is a buffer overflow issue, whereby a series of oversized "dynamic update" DNS queries with SIG (signature) records causes writing beyond the buffer boundary when these records are saved to file. DNS server periodically saves all received updates to file (so they don't get lost on restart or crash), and the issue gets triggered by simply waiting for this file write to happen after sending a number of requests, or by stopping the DNS Server service, which was a convenient time saver for us.
Our proof-of-concept (POC) sends ten malformed DNS requests, and upon stopping the DNS Server service, the dns.exe process crashes. Let's use REVEN to see what goes on.
We used a virtual machine with Windows Server 2019 and DNS Server role without March updates to keep it in the vulnerable state. It is important to record as few events (called "transitions") as possible, so "lightening" of a machine - stopping unnecessary services, and disabling Windows Defender and Windows Error Reporting - is generally a good idea. We did not stop any services but did the latter as error reporting gets triggered upon every process crash and, well, executes a lot of instructions. While it's easiest to start and stop recording manually, even one second of extra recording can create tens of millions of unneeded transitions that will just slow down post-processing of the recording. To optimize start and end of recording, REVEN provides a cool trick they call "ASM stubs", which is a fancy name for calling the int3 CPU instruction while having the ecx register set to some magic value. In other words, you can trigger starting and stopping of recording from within the virtual machine you're recording, which means you can start right before the interesting stuff happens, and stop right after it's done.
In our case, the mere sending of malformed DNS requests does not crash the process, but the stopping of the DNS Server service does. So we used our POC to send the requests, and then launched a batch script that started the recording and stopped the service. Before that, we have "abused" the Postmortem Debugging mechanism to make it launch a small executable that stops recording whenever a process crashes - instead of launching a debugger, which postmortem debugging was designed for.
Our recording generated around 1.25 billion transitions.Yes, that's billion. But it's really no problem because REVEN handles that effortlessly. The only price you pay for a larger recording is the time you have to wait for REVEN to "replay" it, which extracts all machine states and transitions, indexes them, extracts everything that was happening to the memory, downloads symbol files and a bunch of other things to assure a swift analysis thereafter. Our replay took just over an hour and generated 55 GB of data, upon which the analysis would then actually be done.
Granted, we could have tried to further reduce the recording and possibly succeed, but that doesn't make too much sense - one hour for preparation while you're having lunch is more than acceptable, and fiddling with recording optimization also takes time that can quickly exceed the time saved by a potentially quicker replay.
Proceeding to "Axion", the analysis user interface where the magic stuff happens. Axion has multiple widgets; prominently positioned in the middle are Transitions, displaying a small part of the entire recorded execution conveniently grouped in code blocks with one or more CPU instructions. Other widgets include Backtrace (the call stack for the current transition), CPU (register values before and after the current transition), Memory (at chosen address, before or after the current transition, along with the entire history of read and write accesses), Search (immensely useful, allows you to find calls to specific function, or all executions of a specific instruction), Bookmarks, and - my favorite - Taint. REVEN allows you to select any piece of data (e.g., the current value in register r8, or the value at some memory address) and taint it either forward or back in time, to see what this value affects or where it came from, respectively. This is a huge value-add for our analyses - although by no means the only one.
Now let's dive into the analysis. The first thing we need to do is find where dns.exe crashed. Scrolling through a billion+ transitions is obviously out of the question, but we can search for one of the functions that get called when an unhandled exception is thrown. KiUserExceptionDispatch is one such function.
Search finds a single hit, as expected, and here we are looking at the first executed instruction in this user-space function, as it was launched by kernel's KiExceptionDispatch. (The kernel code is on grey background because we filtered out only user-space execution.) Now we want to see why this exception was thrown. A simple press of the "%" key transfers our to the other end of a call-ret pair, or in our case, to the other end of the kernel call.
The "%" key landed us on the exact instruction that caused access violation in function Dns_SecurityKeyToBase64String. It was an attempted write to address r8+1, and the Memory view shows this address to be immediately after a valid memory block, where we see a bunch of A's that were likely written to this buffer in some loop we're probably currently in. We did not use REVEN-IDA integration here but if we did, we would immediately see the code graph for this function in IDA, with the current instruction selected in IDA. And we would see that we are, indeed, in a loop that copies base64-encoded signature value from our DNS request to this buffer that just got overflown, and uses r8 as the destination pointer that gets increased in every iteration.
(Note that we enabled Page Heap for dns.exe to make it crash immediately on buffer overflow, otherwise the overflow could just corrupt the heap and eventually cause some random malfunction. With Page Heap, every heap buffer is allocated at the very end of a read-write memory block, followed by unallocated memory page - which means any typical buffer overflow will immediately trigger an access violation exception.)
Execution of a loop is shown in REVEN as a repetition of loop instructions, over and over again, but you can of course select any of these instructions and see how registers and memory looked like at that exact moment. What we want here, though, is to see where the buffer was allocated.
If we were in WinDbg, we'd have used the !heap command to get the call stack from the moment the overflown buffer was allocated. In REVEN, however, we can not only find the code that allocated the buffer, but also values of registers and content of memory in that precise moment. The simplest way (that I know of) would be to simply taint register r8 to see where its value came from - going back to the past far enough, at some point its value must have been determined by whoever allocated this buffer, or it would not have pointed to the end of this same buffer now.
However, tainting r8 backward produces too long a path that just keeps bouncing in the loop that we're in. While it eventually gets to where we want to be, it slows down the UI. So our first goal is to get to the beginning of our loop's execution. We copied the address of our access-violation-triggering instruction (0x7ff729ca224a) and went to the Search widget, where we searched for all uses of this address.
Results: this exact instruction has been executed 130705 times in our recording; in other words, it would take a significant chunk of one's lifetime to just scroll up to the start of the loop. However, it only took one press of the "Next" button to get to the first iteration of the loop - because we were already positioned on the last iteration.
This got us to the very first execution of our instruction in this recording, and we can see that it wrote 0x41 to memory (to the very same buffer it overflowed 130704 iterations later). The 0x41 left to it was written earlier by a similar instruction in the same loop. Now that we're out of the loop, so to speak, let's taint r8 and see where it got from.
We launched tainting for value of register r8 from the current transition, back to some function we selected high in the call stack; we chose function Zone_WriteBackDirtyZones. Why not taint all the way back to the very first transition? Because we're only interested in who has allocated the memory buffer that got overflown, but the taint would go way further back in time because the address of this allocation was influenced by earlier allocations (that's how the heap works) and that is just irrelevant to us.
Tainting identified a couple of hundred transitions, and we're interested in those at the very end of the list (i.e., the earliest ones). When one is often looking for memory allocations, some familiar Windows functions catch one's eye - and RtlAllocateHeap is one of them.
RtlAllocateHeap takes three arguments, allocation size being the third one - which in x64 calling convention means register r8. We can see that the value of r8 when this function was called was 0x80010. This means that the actual buffer allocated was of this size, but we still want to see if this is a hard-coded size or perhaps dynamically calculated. So we go further back in time through the taint results.
In the very first transition found with tainting, inside funtion File_WriteZoneToFile, we see a call was made to a function Mem_Alloc, which is not publicly documented but is clearly used for allocating the memory block we're after. Most importantly, we can see that a hard-coded value 0x80000 was provided to it, which is clearly the size of the buffer. (0x10 was subsequently added to it inside Mem_Alloc, which is nicely seen by walking through the taint.)
Now we know what happened: a fixed-sized memory block was allocated in function File_WriteZoneToFile, whose name implies that the DNS zone we've updated with our malformed requests was going to be written to file. At some point function Dns_SecurityKeyToBase64String was called that overflowed this buffer after base64-encoding the signature from our requests. Let's just see if function Dns_SecurityKeyToBase64String was called more than once, as we know that a single malformed request doesn't produce the crash.
To do that, we used the "Symbol call" search and provided the function name.
The search produced 6 hits, meaning that function Dns_SecurityKeyToBase64String was called 6 times. This indicates that it was our 6th DNS request that finally caused the writing to go beyond the buffer end. Some additional analysis showed that all these calls added their output to the same growing string inside the fixed-size buffer, which was supposed to be finally saved to the zone file.
Our REVEN analysis was done here. At this point we could have created a micropatch in various ways, making sure to prevent writing past the buffer end, but since we had Microsoft's official patches available, we wanted to see what they have done.
We used IDA and BinDiff to compare dns.exe from February 2021 and March 2021, and found 9 functions modified by the March update.
Technically, we could just search our recording for any execution inside these functions, and hopefully find that just one of them was executed - which would mean that the fix was included there. Actually, we did exactly that and found SigFileWrite to be the only one - but this would be very cumbersome if hundreds of functions were modified, especially if the recording included many of the modified functions that have nothing to do with our bug.
The most reliable method would be to inspect the entire taint list to see which of the modified functions affected the value we were tainting. Taint search is currently not supported by REVEN user interface, but we could undoubtedly use the API to achieve that (note to self: send a feature request to Tetrane). We're not that fluent in REVEN API yet so we took the third route: the call stack.
It is quite likely that one of the modified functions would appear in the call stack of our crash instruction. But in our case we don't see it (see the call stack on one of the screenshots above). We do see, however, that function Dns_SecurityKeyToBase64String seems to have been called by function LdrpDispatchUserCallTarget, which has a suspect name. Let's just click on that in REVEN and see what happened there.
We see that function RR_WriteToFile made a call to LdrpDispatchUserCallTarget, which then made a jump to SigFileWrite function. The latter is executed, but not seen in the call stack because LdrpDispatchUserCallTarget jumped to it instead of calling. Note that function LdrpDispatchUserCallTarget is part of Control Flow Guard as explained in this Trend Micro article. So whenever you see LdrpDispatchUserCallTarget in the call stack, you'll want to look for which function it jumped to.
Now that we have our "patch suspect", let's see how the original (February) and modified (March) versions of SigFileWrite function compare in BinDiff.
We see that four sanity checks were added, perhaps excessively but efficiently:
- If length of the SIG record is less than 0x12 (minimum possible), then exit.
- If length of the SIG record, subtracted by 0x12, is less than length of signer's name, then exit.
- If pointer to the end of the string (where it will be appended) is larger than end of buffer, then exit.
- The final, most complex-looking check, is compiler's "artistic rendition" of multiplication of signature length by 4/3, which is the number of characters that base64-encoding will require. If signature length multiplied by 4/3 is larger than the difference between string end pointer and end of buffer, then exit.
These checks make sure that the buffer will not get overflown, and will silently prevent DNS update records from being written to the zone file if end of buffer has been reached.
Our micropatch does logically the same as Microsoft's, but it also adds an Exploit Blocked alert and log entry in case the buffer would have been overflown, as this would highly likely be a result of an exploitation attempt instead of something that would occur under normal circumstances.
We created this micropatch for the following Windows versions:
- Windows Server 2008 R2 without Extended Security Updates, updated to January 2020
- Windows Server 2008 R2 with year 1 of Extended Security Updates, updated to January 2021
According to our guidelines, this micropatch requires a 0patch PRO license. By the time you're reading this, the micropatch has already been distributed to all licensed online 0patch Agents and also automatically applied except where Enterprise policies prevented that. If you're not a 0patch user and would like to use this micropatch on your computer(s), create an account in 0patch Central, install 0patch Agent and register it to your account with appropriate amount of PRO licenses. Note that no computer restart is needed for installing the agent or applying/un-applying any 0patch micropatch.