Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]

Kernel Exploitation
¿El octavo arte?
RootedCON 2019
Jaime Peñalba Estébanez
@NighterMan

@NighterMan
jpenalba@member.fsf.org
HATER at the interwebz
whoami

whoami
●
Rey de las cosas pequeñas
●
Colaborador en el
hormiguero
●
Mejor mago del 2016
●
Ganador club de la
comedia 2001

@NighterMan
●
Linux kernel researcher at COSEINC
●
Antes molaba
●
Ahora malgasto mi vida en el GDB
whoami

Unrelated Disclaimer
Please, do not use docker
for security purposes.
This applies to docker, lxc,
or any container system
based on namespaces and
cgroups.
Do not think about it as
security container or as
valid security mechanism.
Its just a chroot on steroids.

I’m not a big a fan of exploitation talks:
●
Require audience with deep understanding of the target
●
Are boring and hard to follow
●
If you get distracted, you loose
This should be a trilogy (at least):
●
Talk about kernel basics
●
Talk about exploitation
●
Talk about Memory Management
●
Talk about developing tools and sanitizers
●
Talk about how to find bugs
I’m gonna try to make a mix, and high level
explanation of a real life bug
What is this all about?

We will try to cover the following topics:
●
Introduction to basic kernel concepts
●
Basic kernel exploitation
●
Some Kernel mitigations
●
Kernel memory management
●
Dynamic memory management
●
Real world exploit?
All this will be explained at a very high level, as
each topic could be a hole talk by itself.
Se trasca la magedia….
What is this all about?

The kernel is a computer program that is the core of a computer's operating system, with
complete control over everything in the system. On most systems, it is one of the first programs
loaded on start-up (after the bootloader). It handles the rest of start-up as well as input/output
requests from software, translating them into data-processing instructions for the central
processing unit. It handles memory and peripherals like keyboards, monitors, printers, and
speakers.
The critical code of the kernel is usually loaded into a separate area of memory, which is
protected from access by application programs or other, less critical parts of the operating
system. The kernel performs its tasks, such as running processes, managing hardware devices
such as the hard disk, and handling interrupts, in this protected kernel space. In contrast,
everything a user does is in user space: writing text in a text editor, running programs in a GUI,
etc. This separation prevents user data and kernel data from interfering with each other and
causing instability and slowness, as well as preventing malfunctioning application programs
from crashing the entire operating system.
The kernel's interface is a low-level abstraction layer. When a process makes requests of the
kernel, it is called a system call. Kernel designs differ in how they manage these system calls
and resources. A monolithic kernel runs all the operating system instructions in the same
address space for speed. A microkernel runs most processes in user space, for modularity.
Source: https://en.wikipedia.org/wiki/Kernel_(operating_system)
What is a kernel

Source: N. Murray, N. Horman, Understanding Virtual Memory
What is a kernel

Virtual Memory
Upper Canonical
Non-Canonical
Lower Canonical
0xFFFF FFFF FFFF FFFF
1111111111111111111111111111111111111111111111111111111111111111
0xFFFF 8000 0000 0000
1111111111111111100000000000000000000000000000000000000000000000
0x0000 0000 0000 0000
0000000000000000000000000000000000000000000000000000000000000000
0x0000 7FFF FFFF FFFF
0000000000000000011111111111111111111111111111111111111111111111
1111111111111111 1 00000000000000000000000000000000000000000000000
1 bit (sign-extended)
Most significant implemented bit
Unimplemented bits
47 bits
Used for addressing

Virtual Memory
Virtual memory map with 4-level page tables
●
47 bits address space + 1 bit sign
●
2^47 = 128TB
●
256TB of mappable memory
Virtual memory map with 5-level page tables
●
56 bits address space + 1 bit sign
●
2^56 = 64PB
●
128PB of mappable memory

Virtual Memory
A basic memory map looks like this:

Virtual Memory
Upper Canonical
Non-Canonical
Lower Canonical
0xFFFFFFFF FFFFFFFF
0xFFFF8000 00000000
0x00000000 00000000
0x00007FFF FFFFFFFF
Kernel Space virtual mem
shared between all processes
User Space virtual mem
different per mm
TASK_SIZE

Virtual Memory
Upper Canonical
Non-Canonical
Process 234
0xFFFFFFFF FFFFFFFF
0xFFFF8000 00000000
0x00000000 00000000
0x00007FFF FFFFFFFF
Kernel Space virtual mem
shared between all processes
User Space virtual mem
different per mm
TASK_SIZE
Process 312Process 1453Process 457

In a kernel release far, far away...

NULL Pointer Dereference
Let’s suppose we have the following code running
in the kernel as a miscdevice at “/proc/vulnerable”.
0x0000000
0x0000000+4

This userspace program triggers the vulnerable
code in the kernel using an ioctl.
It should trigger a NULL pointer dereference and
produce a kernel panic / system crash.

What if we mmap addr 0x0 on user space?

struct my_struct *tmp;
tmp->get_counter();
struct my_struct {
int counter;
int (*get_counter) ();
};
void escalate_privs()
{
/* Do hacker things */
return;
}
0xFFFFFFFFFFFFFFFF
0x0000000000000000
Kernel Space
User Space
TASK_SIZE

Congratulations!
You traveled back to 1999 and are now OSEE certified

Linux Kernel Defence Map 1
Source: https://github.com/a13xp0p0v/linux-kernel-defence-map

SMAP/SMEP
SMAP
●
Supervisor Mode Access
Prevention
●
Cannot access/dereference any
pages that are userspace (U=1)
while CPU is running in
privileged mode (CPL=0)
●
Uses AC flag in EFLAGS register
●
Two new instructions
CLAC/STAC
●
21th bit in CR4 register
SMEP
●
Supervisor Mode Execution
Prevention
●
Cannot execute code from any
pages that are userspace (U=1)
while the CPU is running in
privileged mode (CPL=0)
●
20th bit in CR4 register
●
Older/more common than
SMAP
These are implemented at hardware level. Its the CPU itself
who enforces the protection (such as the NX bit).

SMAP/SMEP
tmp->get_counter();
struct my_struct {
int counter;
};
{
return;
}
0xFFFFFFFFFFFFFFFF
0x0000000000000000
Kernel Space
User Space
TASK_SIZE
SMEP
SMAP

KPTI (KAISER)
Kernel Page Table Isolation was designed to try to stop
leaks caused by side channel attacks such as meltdown or
spectre.
●
Two sets of page tables are maintained once used while runing
in privileged mode and another one for unprivileged mode.
●
Userspace PTEs hide most of the kernel mappings, reducing it
to a minimum: kernel code for handling syscalls, interrupts, etc.
●
Kernel PTEs mark the hole user space area as non-executable.
This turns out to be something like a software implementation
of SMEP.
●
Impact on performance from 5% to 30%. Mostly due to TLB
flushing. It can be improved on CPUs supporting PCID.
Source: https://es.wikipedia.org/wiki/Aislamiento_de_tablas_de_páginas_del_núcleo

KPTI (KAISER)
Kernel space
User space
Kernel space
User space
Kernel space
User space
No KPTI KPTI
Kernel PTEs Userspace PTEs

KPTI (KAISER)
tmp->get_counter();
struct my_struct {
int counter;
};
{
return;
}
0xFFFFFFFFFFFFFFFF
0x0000000000000000
Kernel Space
User Space
TASK_SIZE
NX

KMEM CACHES
kmalloc() is the normal method of allocating memory for objects
smaller than page size in the kernel.
These objects will be stored in different caches based on its size,
or the cache requested.

KMEM CACHES
object object object object object object object object
One full SLAB at kmalloc-192 cache contains 20 chunks
Full SLAB can be one or more pages
One chunk is
192 bytes in size

Allocators
The Linux kernel has implemented different allocators over the
years.
Nowadays we have 3 different allocators available. You must
choose one of them at compilation time:
●
SLAB: The first one
●
SLUB: Smaller memory footprint and less locking
●
SLOB: Aimed at embedded systems
Each allocator have its own way to track free and allocated
objects, local cpu caches, etc...

SLAB Allocator example
Source: Understanding the Linux Virtual Memory Manager / Mel Gorman.

TCP KMEM_CACHE (SLAB Allocator)

Our bug
I wont be giving any details on the bug itself, so you will
have to believe me…
Its a race condition, and the code displayed is just an
example...

Our exploit
Our base bug is a race condition which causes corruption
onto a circular doubly linked list.
An outline of the exploitation is something like:
1. Use the race condition to add the same item twice into the list
2. Free the item which removes it from the list just once
3. Linked list still contains one of the entries but the item is freed
4. We now have an Use-After-Free
5. Repeat steps 1 to 4, to produce a second UAF
6. Use both UAF in the list to achieve a write-what-where
primitive
So somehow we turned a race condition into a write-what-
were.

Doubly linked list
Source: Understanding the Linux Kernel 3rd
edition. Daniel P. Bovet and Marco Cesati.

Empty list
next
prev
list_head
An empty doubly linked list just points to itself for the next
and prev fields.

Adding one item
list.next
list.prev
next
prev
list_head
vuln_subsys (1)

Adding another item
list.next
list.prev
list.next
list.prev
next
prev
list_head
vuln_subsys (2) vuln_subsys (1)

Removing the 2nd item
list.next
list.prev
next
prev
list_head
vuln_subsys (1)

Abusing the race condition
What would happen if we call add_vuln_subsys()
at the same time from two different processes
with the same vuln_subsys *entry ?

list.next
list.prev
next
prev
list_head
vuln_subsys (1)
Adding the same entry twice. First add:

Adding the same entry twice. Second add:
list.next
list.prev
next
prev
list_head
vuln_subsys (1)

After the double add, we free the vuln_subsys
*entry. Which will remove the entry from the list
and free the obj.

list.next
list.prev
next
prev
list_head
vuln_subsys (1)
0xdead000000200200

If you remember the free_vuln_subsys() function,
it does not just remove the item from the list, but
it also frees the object.
So we now have an use-after-free.

list.next
list.prev
next
prev
list_head
vuln_subsys (1)
0xdead000000200200
The chunk
is now free

Usually from here you would try to allocate over the
recently freed memory chunk with data under your control.
Later call to some code path which would make use of that
freed object, and try to take control of the system.
Common ways to take control are overwriting a function
pointer on the structure/object if there is one.
In this case there was just a work queue, which we could
redirect to a fake work queue under our control. Sadly that
queue is only triggered after some special events, so we
need to find another way….

UAF / Heap Spraying
In order to exploit bugs related to dynamic memory
management such as use after frees usually you want to
find a mechanism that meets the following criteria:
●
Be able to allocate and deallocate memory at will.
●
Control the allocation size to be able to hit different
caches.
●
You will often find that spraying is required. So having
high limits is desired.
●
It wont add any data before or after your allocation
I tend to use two ways to do such a thing.

UAF / Heap Spraying
System V message queues – msgsnd()
We can create a new message queue, and send messages
into the queue to allocate, and consume messages from
the queue to deallocate.
●
First 48 bytes of the allocation will be used for the
msg_msg structure, and we do not fully control that
struct
●
There are limits on max number of message queues
which can be created, and also on the max number of
messages queued for each queue.
●
There are not restrictions on NULL bytes or anything.
●
We can use mtype to mark messages.

UAF / Heap Spraying
Key Management – add_key() keyctl()
We can create new keys to allocate, and delete keys to
deallocate. We can use the key name or the key data to
store data. I normally use the key name.
●
No data is added before our buffer, but it appends a NULL
byte at the end
●
Very limited by key quota around 200 keys or 20000 bytes
●
Maximum allocation is around 4093 bytes
●
Key names cannot be equal (we need to change 1 byte at
least)
●
NULL bytes are not allowed

WARNING
Things start to go bonkers from now own.

Abusing the UAF
Looks like there is no obvious way to quickly
exploit this…. (You have to believe me)
So… What if we add another entry into the
doubly linked list?

Abusing the UAF
list.next
list.prev
next
prev
list_head
vuln_subsys (1)
The chunk
is now free
list.next
list.prev
vuln_subsys (2)

Abusing the UAF
By adding a second entry into the linked list,
we will overwrite the first entry’s list.prev field.
In this case the list.prev field is at offset 176.
The data written will be the address of the
previous object in the list plus 168 bytes
which is the offset where the list_head field is
held within the object.
So now we have a new exploit primitive, an 8
bytes write of uncontrolled data at offset 176.

Abusing the UAF
So now that we know we can overwrite with 8 bytes
of uncontrolled data at offset 176, we have to search
for the following:
1. A structure which will be allocated in the kmalloc-192
cache
2. With a pointer or something like at offset 176 which we
will be overwriting
3. Which can be allocated an deallocated at will
Finding such case is not easy, as the size-192 cache covers
objects from 128 to 192 bytes, and writing at offset 176,
leaves us just with objects/structures which are 184 to 192
bytes in size.

Abusing the UAF
Several ideas to look for candidates:
●
Search for structures between 184 and 192 bytes in
size which contain a pointer at offset 176 and are
part of the kernel core.
●
Allocate a 176 bytes string, and overwrite the null
terminator with our primitive. (more crazy shit)
●
Exercise the kernel and examine the kmalloc-192
cache. Developing some scripts is required...

Abusing the UAF
Ended up developing some scripts to do the
following tasks:
●
Exercise the kernel as much as we can
●
Extract the stacktraces of all the allocations
in the kmalloc-192 cache
●
Sum all the allocations and remove
duplicates
At last we should manually review all the
unique stacktraces and looks for useful ones...

Abusing the UAF
After a lot of research, we found a suitable
candidate. It involves using IPC System V
shared memory segments.
This meets all the criteria previously defined:
●
A structure which will be allocated in the kmalloc-192
cache
●
With a pointer or something like at offset 176 which we
will be overwriting
●
Which can be allocated an deallocated at will

Abusing the UAF
struct ipc_rcu + struct shmid_kernel
64 bytes + 120 bytes = 184 bytes
struct ipc_rcu + shmid_kernel.mlock_user offset
64 bytes + 112 offset = 176

Abusing the UAF
list.next
list.prev
next
prev
list_head
vuln_subsys (1)
0xdead000000200200
The chunk
is now free

Abusing the UAF
shm_cprid
shm_lprid
*mlock_user
next
prev
list_head
vuln_subsys (1)
shmid_kernel
struct ipc_rcu
(64 bytes)
struct shmid_kernel
(120 bytes)
int seg = shmget(key++, 4096, IPC_CREAT | 0600);

Abusing the UAF
shm_cprid
shm_lprid
*mlock_user
next
prev
list_head
vuln_subsys (1)
shmid_kernel
list.next
list.prev
vuln_subsys (2)
Pointer overwritten

Abusing the UAF again
shm_cprid
shm_lprid
*mlock_user
next
prev
list_head
vuln_subsys (1)
shmid_kernel
list.next
list.prev
vuln_subsys (2)
0xdead000000200200

Abusing the UAF again
shm_cprid
shm_lprid
*mlock_user
next
prev
list_head
vuln_subsys (1)
shmid_kernel
__count
processes
files
...
inotify_devs
vuln_subsys (2)
fake mlock_user
alloc_msgsnd(char *)&fake_mlock_user, 140);

If we take a look to the mlock_user field it points to an
struct user_struct which we will store at the second UAF
object. It is 104 bytes in length.
But mlock_user will point to offset 168 inside a 192 bytes
chunk. If we do the math: 168 + 104 = 272. So when
reading this structure we will be reading past the 192 bytes
chunk, and reading from the next chunk….

shm_cprid
shm_lprid
*mlock_user
vuln_subsys (1)
shmid_kernel
__count
processes
files
...
inotify_devs
vuln_subsys (2)
fake mlock_user
First 168 bytes
are not used
192 bytes chunk
Allocated at
`kmalloc-192`
cache
Only 24 bytes
Left on this
chunk
Contiguous chunk
at
`kmalloc-192`
cache
Some
other
object
in the
slab
Remaining data
overflows into
next chunk in
the slab
Offset 0
Offset 168
Offset 192
Offset 0

Don’t worry, there are more problems...
Allocating over the freed chunks might look easy at first,
but it is not. Memory allocation its not a deterministic
process, there are many other tasks in the kernel, processes
in the system, interrupts, etc… which will be competing
against you to allocate over and changing the layout of the
kmem cache.
There are local objects caches per cpu for fast/lockless
allocation and deallocation paths. So binding to a cpu is
required.
Your success can vary depending on memory
fragmentation, memory pressure, system load, etc…
Your task will be turn something non-deterministic and
unreliable by nature, into something more or less reliable…

Our fake mlock_user
structure has been allocated
here.
When reading past the first 24 bytes
of the structure, we will start
reading into this next chunk.

Original structure
declaration
Structure declaration aligned to
offset 168 for kmalloc-192 cache
New structure start offset (168)

kmalloc-192 slab
One full SLAB at kmalloc-192 cache contains 20 chunks
One chunk is
192 bytes in size
The goal for our heap massage is to control a hole slab with
our custom data, so whenever the read overflow happens,
data being read will be under our control. Or at least it will
probably be under our control…..

kmalloc-192 partial slabs
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FREE ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FREE ALLOCATED ALLOCATED ALLOCATED ALLOCATED FREE ALLOCATED
ALLOCATED ALLOCATED FREE FREE ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FREE FREE FREE FREE ALLOCATED ALLOCATED
Partialslabs
First we will fill all those gaps in order to force the creation
of a new hole slab which is the one we will be trying to
take control of.
We can read /proc/slabinfo if available in order to find how
many partial slabs and objects exists.
Use System V message queues to fill gaps. msgsnd()

kmalloc-192 filled slabs
Next we will fill two new slabs containing or “unaligned”
fake mlock_struct structures data.
To do the allocation, we can use the key managament
facility to do this. Using this method, we cannot use null
bytes.
When using this method we are limited by the key quota.
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED
ALLOCATED ALLOCATED FILL FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FILL FILL FILL FILL ALLOCATED ALLOCATED
Fullslabs

kmalloc-192 filled slabs
Fullslabs
alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key
Allocate two full slabs (40 items) with fake
user_struct using `alloc_key_name()` method
Newslabs

kmalloc-192 create gaps
Fullslabs
alloc_key alloc_key HOLE alloc_key HOLE alloc_key HOLE alloc_key
Create some gaps into the slabs by freeing keys
Newslabs

Fullslabs
alloc_key alloc_key vuln_subsys alloc_key HOLE alloc_key HOLE alloc_key
Allocate a new vuln_subsys (2)
Newslabs

Fullslabs
alloc_key alloc_key vuln_subsys alloc_key HOLE alloc_key HOLE alloc_key
Trigger the race and free the vuln_subsys (2) to
produce an UAF
Newslabs

Fullslabs
alloc_key alloc_key vuln_subsys
msgsnd
alloc_key HOLE alloc_key HOLE alloc_key
Allocate over the UAF another fake mlock_user
Using msgsnd() method
Newslabs

First chunk allocated
using alloc_msgbuf
And containing a null
byte
Second chunk containing the
overflowed data. Allocated
using alloc_key_name
Structure will start being
read from offset 167
Read will continue from
offset 0 on the next
chunk

write-what-where
So we have a shmid_kernel structure pointing to an
user_struct structure fully controlled by us. What can be
done with it?
Lets take another look to the user_struct structure:

write-what-where
This piece of code can be triggered when destroying the shared memory
segment. But we have to meet several conditions:
●
The shared memory segment should have been locked. Otherwise the
user_struct structure is ignored.
●
The __counter field on the user_struct structure must be set to 1. This
way free_user() function will be called when the reference counter is
decreased to 0.

privilege escalation
Now that we have a proper write-what-where, we can write 8 bytes of
data anywhere in the memory.
There are many things we can overwrite in order to escalate privileges.
Among the most common ways is to overwrite a file_operations
structure. There are tons of them, are easily located in memory as they
are defined as globals, and they contain tons of function pointers.
I personally do not use this method. But lets suppose I do….

Our target will be the file operations for “/dev/ptmx”. And we will
overwrite the llseek function pointer, as it points to no_llseek in the
ptmx_fops structure, and nobody should call to it...

If there were no protections in place, we could simply point
tty_fops.llseek into an userspace address with our code. But it is not
1999 any more…
So what we are going to do, is point it to a rop gadget which will pivot
the stack. Lets take a look to the llseek() function prototype:
When calling llseek() from userspace we can control offset and origin
fields, which will be at rsi and rdx respectively. So gadgets such as the
following ones should do the job:
loff_t no_llseek(struct file *file, loff_t offset, int origin)
mov rsp, rsi; ret;
xchg rsi, rsp; ret;

Triggering the stack pivot should be as easy as running this code from
userspace:
The kernel stack will point to 0xdeadbeefdeadbeef where we have our
full ropchain to disable SMAP/SMEP or mark a kernel page from the
direct mapping range as executable.
Sample ROP chain:

Although our fake stack, and the escalation code are part of our exploit code
which is running in userspace, we are referencing to it using addresses from
kernel space.
There are some ways to find were our userspace memory is mapped into the
direct mapping of the kernel space.
This direct mapping is marked as non-executable. That’s why we use the ROP
chain to mark the page containing our code as executable prior jumping to it.

Cleanup
To finish up with the exploitation is necessary to cleanup any
mess to keep the kernel stable, otherwise all your work will be
worthless….
●
Fix any kmem caches we might have corrupted
●
Restore any pointers overwritten
●
Restore the stack so the kernel will properly return to
userspace
●
Fix any other mess you may have made….

THE END
@NighterMan

Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]

Similar to Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019] (20)

More from RootedCON

More from RootedCON (20)

Recently uploaded

Recently uploaded (20)

Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]