Tuesday, July 1, 2014

I Was Wrong - Proving LZ4 Exploitable With Less Than 4MB

But Not In the Way You Might Think

For the uninitiated, I recently uncovered a vulnerability in LZ4 during triage with the Linux kernel team on a separate but very similar issue in LZO. Ludwig Strigeus uncovered the issue over a year ago and posted it to the LZ4 Google Code bug issue list. Rather than hand waving over the resulting mis communications and bug dismissal that followed, let's focus on the facts
  • The bug was initially labeled a non-issue
  • No fix was provided for over a year
  • The bug is said to not be vulnerable for payloads less-than 8MB in size
  • The LZ4 implementation demands that less-than 8MB blocks be used
All of the above information can be found on the LZ4 Google Code project page bug report that Ludwig filed. A conversation between myself and Yann follows the discussion between Ludwig and Yann. 

In that discussion we can find Yann's assertion that blocks less-than 8MB are not vulnerable and that blocks less-than 4MB are the modern requirement for LZ4. Thus, no modern implementations should be affected at all.

This is only true for the exploit case where the attacker generates a 'length' value so large that it causes the 'cpy' pointer to overflow, and point to an address prior to the start of the output buffer. This is the attack that I focused on as this is the same attack I had been using successfully against LZO. I have documented this attack in this previous blog post

What is imperative, that I completely missed because of focusing on pointer negation, is the context in which the code would execute. That is where today's post comes in, which stemmed from a conversation I had with Richard Johnson late last week. 

After getting some time to evaluate the issue further, last night I realized something. I was wrong, and so was Yann. A payload greater-than or equal-to 8MB is not required. In fact, you can abuse this algorithm with less than 4MB of data. 

Switching Context

Context is an important thing, isn't it? The LZ4 code in the Linux kernel has been deemed invulnerable because each use case requires a payload that adheres to Yann's implementation requirements; the 8MB or 4MB that I cited above. But, there's a catch. This presumes that we need more than that amount of data to do something malicious. 

Let's look again at the kernel code. Follow along at the LXR found here. This time, we're looking at the lz4_uncompress_unknownoutputsize function, which is the supposedly "secure" variant. 

186                 length = (token >> ML_BITS);
187                 if (length == RUN_MASK) {
188                         int s = 255;
189                         while ((ip < iend) && (s == 255)) {
190                                 s = *ip++;
191                                 length += s;
192                         }
193                 }
194                 /* copy literals */
195                 cpy = op + length;
196                 if ((cpy > oend - COPYLENGTH) ||
197                         (ip + length > iend - COPYLENGTH)) {

We see the exact same size accumulation loop at line 189. Okay, so we can generate large sizes. We know that already. Let's take another look at line 195. 

This is what I missed from my brief evaluation during LZO triage. Yes, it's vulnerable, but in a much more subtle fashion than I thought. In a kernel context, 'op' will point to an address in memory that is much more interesting than an address in a user-land application. 

For contrast, let's take a brief look at the Btrfs LZO implementation code. When LZO is being used in Btrfs, a workspace is built in virtual memory. Presumably because of the larger size requirements vmalloc is used. We can see this code below, or on the LXR here

 49 static struct list_head *lzo_alloc_workspace(void)
 50 {
 51         struct workspace *workspace;
 53         workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
 54         if (!workspace)
 55                 return ERR_PTR(-ENOMEM);
 57         workspace->mem = vmalloc(LZO1X_MEM_COMPRESS);
 58         workspace->buf = vmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE));
 59         workspace->cbuf = vmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE));
 60         if (!workspace->mem || !workspace->buf || !workspace->cbuf)
 61                 goto fail;

Now you may be asking "Why is this important?" and "What does this have to do with LZ4?". Everything. It has everything to do with LZ4.

Let's take an ARM platform as an example, since - as we know - that is an extremely important 32bit architecture. The Linux kernel LZ4 implementation makes special cases for use on the ARM architecture, as can be seen in this file on LXR.

 28 #if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)             \
 29         || defined(CONFIG_ARM) && __LINUX_ARM_ARCH__ >= 6       \
 30         && defined(ARM_EFFICIENT_UNALIGNED_ACCESS)

So, knowing that LZ4 is targeted as code used on ARM, think again about the vmalloc call. Take Raspberry Pi as a simple example use case on ARM. This is the kernel's memory layout on boot in a current-day Raspberry Pi system, running Linux kernel 3.12.22+ #691 PREEMPT. 

[    0.000000] Memory: 374640K/393216K available (4376K kernel code, 238K rwdata, 1340K rodata, 143K init
, 701K bss, 18576K reserved)
[    0.000000] Virtual kernel memory layout:
[    0.000000]     vector  : 0xffff0000 - 0xffff1000   (   4 kB)
[    0.000000]     fixmap  : 0xfff00000 - 0xfffe0000   ( 896 kB)
[    0.000000]     vmalloc : 0xd8800000 - 0xff000000   ( 616 MB)
[    0.000000]     lowmem  : 0xc0000000 - 0xd8000000   ( 384 MB)
[    0.000000]     modules : 0xbf000000 - 0xc0000000   (  16 MB)
[    0.000000]       .text : 0xc0008000 - 0xc059d54c   (5718 kB)
[    0.000000]       .init : 0xc059e000 - 0xc05c1ee4   ( 144 kB)
[    0.000000]       .data : 0xc05c2000 - 0xc05fd900   ( 239 kB)
[    0.000000]        .bss : 0xc05fd90c - 0xc06ad0f8   ( 702 kB)

This means that the region reserved for vmalloc starts at 0xd8800000. That's pretty high up in memory, isn't it. Now, let's look again at the LZ4 pointer arithmetic.

194                 /* copy literals */
195                 cpy = op + length;
196                 if ((cpy > oend - COPYLENGTH) ||
197                         (ip + length > iend - COPYLENGTH)) {

So, presuming we are on a (very common) ARM platform, the lowest likely address in memory for the output pointer 'op' is 0xd8800000. And that is very generous. Obviously, a lot more memory will be allocated far before a call to LZ4 is ever made, meaning that the address actually used for the copy will be far higher in mem. But, for simplicity, let's focus on the lowest possible address. 

For this pointer arithmetic to cause an overflow we now only need 0x100000000 - 0xd8800000 in the 'length' value. Seems like a lot? It's not. Let's do the math. 

0x100000000 - 0xd8800000 = 0x27800000, or 632MB

Since we know that we increase the value of 'length' by 255 for every byte we can divide 632MB by 255. This means it takes 

0x27800000 / 255 = 2,598,823 

2,598,823 bytes of the value 0xff, plus one final byte of 0xa7 are required to generate the exact 'length' value of 0x27800000. When this 'length' value is added to an 'op' address of 0xd8800000, what do we get? We get nothing. That's right, the address zero. 

0xd8800000 + 0x27800000 = 0x00000000

That means it only takes approximately 2.47 MB to access page zero on ARM from the vmalloc region.

Now I'm Nothing

But, you might be asking yourself "But what the hell is at address zero? Nothing, that's what!" 

Nope. On many RISC platforms, just like ARM, this is a vector area that handles interrupts, exceptions, and other traps. It's an extremely critical area. More importantly, just above this area, is where user-land sits. And, according to this post on older ARM kernels, user-land task structures (meaning the structure that holds running privileges) is contained there.  Edit: Not being a Linux kernel guy (my ARM experience is with Real Time kernels, base bands/GPS executives, and Plan 9) I misunderstood the TASK information below. As Spender points out, this does not mean task structure. 

TASK_SIZE PAGE_OFFSET-1 Kernel module space
    Kernel modules inserted via insmod are
    placed here using dynamic mappings.

00001000 TASK_SIZE-1 User space mappings
    Per-thread mappings are placed here via
    the mmap() system call.

00000000 00000fff CPU vector page / null pointer trap
    CPUs which do not support vector remapping
    place their vector page here.  NULL pointer
    dereferences by both the kernel and user
    space are also caught via this mapping.

So what does this mean? This means that an attacker with very little memory (even in accordance to the LZ4 specification) can potentially bypass address verification checks and overwrite critical structures at low addresses in memory.

Edit: While this is technically True (for some constrained cases of True) only older ARM variants keep vectors at ((void*)0), newer ones map vectors to ((void*)0xffff0000), so there isn't necessarily a valid page at all at such low addresses.


Effectively, this post proves that
  • Exploits can be written against current implementations of LZ4
  • Block sizes less than 8MB (and even less than 4MB) can be malicious
  • Certain platforms are more affected than others (primarily RISC: ARM)
  • Protecting against the 16MB and greater flaw was not sufficient
Edit: It's also important to note that this means user-land applications are more affected by the LZ4 bug than previously thought. Even though modern platforms include PIE/ASLR/NX protections to diminish the impact of such a bug, this is the kind of critical arbitrary-write bug attackers look for when they have a corresponding memory information disclosure (read) that exposes addresses in memory. If the attacker can calculate the correct offset to use to overwrite memory, the attacker could affect user-land applications. Memory pressure would have to come into play here, as well. 

Don A. Bailey
Founder / CEO
Lab Mouse Security

1 comment:

  1. So... I've lost track at this point. ZFS, vulnerable or not at this point?