by Luka Treiber and Mitja Kolsek of 0patch Team
Those of you following our micropatching initiative already know that micropatching makes it possible to fix vulnerabilities without restarting the computer or even relaunching the patched application that a user might currently be using. In other words, no disruption for users and servers.
Now let's take it a step further. Do you know which component of an IT environment one least wants to restart? You guessed it: a hypervisor. Especially if dozens or hundreds of critical virtual machines are running, which all need to be stopped or suspended, and can't get back online until the hypervisor is patched. And then if the patch turns out to be broken... you get the picture - and it's not a pretty one.
When the next Heartbleed, Shellshock, or a "guest-to-host escape" vulnerability comes out, you can be pretty sure that hypervisors all around the World will get massively patched - and restarted. And lots of people, from hypervisor vendors to CIOs, admins and end users, will go through various levels of unhappiness.
A few weeks ago, Comsecuris published a detailed report on three vulnerabilities in VMware Workstation that allowed a malicious guest to cause a memory corruption in the hypervisor (vmware-vmx.exe) running on host (and we all know what that leads to). Nico and Ralf cleverly patched a guest component of VMware's graphics-related DLL in a guest machine to make it send a malformed data structure to the hypervisor - which then crashed because it lacked the sanitization checks. (Interestingly, the debug version of the hypervisor did have assert statements with these checks, and these turned out to be quite helpful for both vulnerability analysis and patching.)
All three Comsecuris' vulnerabilities have been patched by VMware, two in Workstation 12.5.5, and one in Workstation 12.5.7. We decided to write a micropatch for the latter to show how a hypervisor can be patched without stopping virtual machines running on top of it.
Reproducing the PoC
We reproduced the Comsecuris' "dcl_resource" PoC on a 64-bit Windows 10 machine running on VMware Workstation 12.5.5. (Fun fact: you can run a VMware Workstation inside another VMware Workstation.)
For those of you who might want to play with this PoC: install Python for Windows and Visual C++ 2015 Redistributables, place PoC files in a folder, launch exec_poc.py and wait for the virtual machine to crash. If it doesn't crash (as it didn't for us at first), make sure your virtual machine's Hardware Compatibility is set to "Workstation 12.x" or something else reasonably new, otherwise DirectX 10 will not be supported, and the PoC relies on that.
Once we got it working, the PoC crashed the vmware-vmx.exe running our virtual machine. We attached a debugger but even though Comsecuris' report was quite detailed we found it hard to match our crash context to their analysis, as there is a lot of convoluted code there to look at.
So using a hint from Comsecuris' report, we replaced vmware-vmx.exe with vmware-vmx-debug.exe to use the debug version of the hypervisor, and repeated the procedure. The result was this:
An assert message popped up revealing the assert's source code line. It seemed odd that an exploit that apparently hasn't been envisioned in the release version would be stopped by an assert, but hey, it looked really promising. So we searched the disassembly for shaderTransSM4.c string and the hex equivalent of 1856 (the line number) - 740h. And found a match (lower left orange block):
When scrolling up the code graph we got a view of a whole chain of asserts originating in a case clause named case 88 (upper right orange block):
Next we disassembled the release vmware-vmx.exe (the one that crashed) and found a matching case 88 clause there - but no checks resembling the assert chain which we found in the debug version.
We then made a diff with the fixed release 12.5.7 version of vmware-vmx.exe and found the exact same cascade of checks following the case 88 clause. So VMware developers apparently took the assert statements form the debug version and turned them into actual release version checks. The image below shows the patched case 88 code branch on the left, and its vulnerable match on the right.
The last block in the cascade (dark red) routes either to the green default case clause (also present in the vulnerable version on the right) or to a red block on the left which directs execution towards an error handler if rbp+1A8h points to a value of 80h or greater. That dark red block was the one that stopped the PoC. Rereading the original report also revealed a parallel to our conclusion. It said the disclosed vulnerabilities "were fixed with the exact same code as in the debug version".
Writing a Micropatch
With that information in our hands we could create a micropatch. We chose to set our patch location one instruction before the last jmp instruction in the grey box - on mov [r12+1Ch], ecx. We could theoretically inject it after the jmp but 0patch Agent currently does not support patching a relative jmp instruction (it will in the future). In the patch we implemented an equivalent of the buffer overflow check from the dark red block above; we only had to replace rbp+1A8h with an equivalent that worked in our context: r12+50h. If an attempted overflow is detected, our patch code calls up an "Exploit Attempt Blocked" dialog, sets rcx to string "0patch: Exploit Blocked for CVE-2017-4924!", and jumps to the error handler in the original code that writes our custom message string to vmware.log and terminates the processing of the malformed data structure.
So the nice thing about this patch is not only that it fixes the vulnerability without disrupting running virtual machines, but it also records an error in VMware Workstation's log for subsequent inspection.
This is the resulting 0pp file, the source code for our micropatch.
See a video of our micropatch in action. As you can see after we demonstrate the crash, we patch vmware-vmx.exe while the virtual machine is running, and the PoC gets blocked.
We wanted to demonstrate how patching a vulnerability in a hypervisor could look like in the future: instant, without disturbing the running virtual machines, strictly targeted at a particular vulnerability (as opposed to replacing megabytes of code) and also instantly un-patchable in case of a flawed patch. While clearly VMware Workstation is unlikely to host critical machines that are costly to stop or suspend, this same vulnerability also affected ESXi - which is running thousands of just such machines around the World. Unfortunately we can't micropatch ESXi (yet) but if there's interest, there is no reason why that couldn't be done.
You might have noticed that our patch only addresses the exact flaw demonstrated by Comsecuris' PoC, while the debug hypervisor version has a cluster of assert statements, each likely to be triggered on a different invalid value. So an exploit writer could probably walk through these assert statements and compile PoCs for additional cases not addressed by our micropatch. That said, we're not here to create new PoCs, and we only make micropatches for vulnerabilities we can prove (i.e., for which we have a PoC for). If VMware was using micropatching, they could have easily implemented all these checks as micropatches (or even a single micropatch, albeit a bit larger than usual).
We're hoping to get the idea of micropatching to all product development groups who know that applying patches can be really costly for their users - and want to do something about it.
Finally, thanks to Nico Golde of Comsecuris for helpful hints on getting their PoC to work, and useful ideas about patching.
If you have 0patch Agent installed (it's still free!), all the magic is already there: this micropatch is already on your computer and is getting automatically applied whenever you launch VMware Workstation 12.5.5. Contact us if you want to have this same patch for some other version of VMware Workstation.