6. Unrelated Disclaimer
Please, do not use docker
for security purposes.
This applies to docker, lxc,
or any container system
based on namespaces and
cgroups.
Do not think about it as
security container or as
valid security mechanism.
Its just a chroot on steroids.
7. I’m not a big a fan of exploitation talks:
●
Require audience with deep understanding of the target
●
Are boring and hard to follow
●
If you get distracted, you loose
This should be a trilogy (at least):
●
Talk about kernel basics
●
Talk about exploitation
●
Talk about Memory Management
●
Talk about developing tools and sanitizers
●
Talk about how to find bugs
I’m gonna try to make a mix, and high level
explanation of a real life bug
What is this all about?
8. We will try to cover the following topics:
●
Introduction to basic kernel concepts
●
Basic kernel exploitation
●
Some Kernel mitigations
●
Kernel memory management
●
Dynamic memory management
●
Real world exploit?
All this will be explained at a very high level, as
each topic could be a hole talk by itself.
Se trasca la magedia….
What is this all about?
11. The kernel is a computer program that is the core of a computer's operating system, with
complete control over everything in the system. On most systems, it is one of the first programs
loaded on start-up (after the bootloader). It handles the rest of start-up as well as input/output
requests from software, translating them into data-processing instructions for the central
processing unit. It handles memory and peripherals like keyboards, monitors, printers, and
speakers.
The critical code of the kernel is usually loaded into a separate area of memory, which is
protected from access by application programs or other, less critical parts of the operating
system. The kernel performs its tasks, such as running processes, managing hardware devices
such as the hard disk, and handling interrupts, in this protected kernel space. In contrast,
everything a user does is in user space: writing text in a text editor, running programs in a GUI,
etc. This separation prevents user data and kernel data from interfering with each other and
causing instability and slowness, as well as preventing malfunctioning application programs
from crashing the entire operating system.
The kernel's interface is a low-level abstraction layer. When a process makes requests of the
kernel, it is called a system call. Kernel designs differ in how they manage these system calls
and resources. A monolithic kernel runs all the operating system instructions in the same
address space for speed. A microkernel runs most processes in user space, for modularity.
Source: https://en.wikipedia.org/wiki/Kernel_(operating_system)
What is a kernel
12. Source: N. Murray, N. Horman, Understanding Virtual Memory
What is a kernel
13. Virtual Memory
Upper Canonical
Non-Canonical
Lower Canonical
0xFFFF FFFF FFFF FFFF
1111111111111111111111111111111111111111111111111111111111111111
0xFFFF 8000 0000 0000
1111111111111111100000000000000000000000000000000000000000000000
0x0000 0000 0000 0000
0000000000000000000000000000000000000000000000000000000000000000
0x0000 7FFF FFFF FFFF
0000000000000000011111111111111111111111111111111111111111111111
1111111111111111 1 00000000000000000000000000000000000000000000000
1 bit (sign-extended)
Most significant implemented bit
Unimplemented bits
47 bits
Used for addressing
14. Virtual Memory
Virtual memory map with 4-level page tables
●
47 bits address space + 1 bit sign
●
2^47 = 128TB
●
256TB of mappable memory
Virtual memory map with 5-level page tables
●
56 bits address space + 1 bit sign
●
2^56 = 64PB
●
128PB of mappable memory
16. Virtual Memory
Upper Canonical
Non-Canonical
Lower Canonical
0xFFFFFFFF FFFFFFFF
0xFFFF8000 00000000
0x00000000 00000000
0x00007FFF FFFFFFFF
Kernel Space virtual mem
shared between all processes
User Space virtual mem
different per mm
TASK_SIZE
18. Virtual Memory
Upper Canonical
Non-Canonical
Process 234
0xFFFFFFFF FFFFFFFF
0xFFFF8000 00000000
0x00000000 00000000
0x00007FFF FFFFFFFF
Kernel Space virtual mem
shared between all processes
User Space virtual mem
different per mm
TASK_SIZE
Process 312Process 1453Process 457
21. NULL Pointer Dereference
Let’s suppose we have the following code running
in the kernel as a miscdevice at “/proc/vulnerable”.
0x0000000
0x0000000+4
22. NULL Pointer Dereference
This userspace program triggers the vulnerable
code in the kernel using an ioctl.
It should trigger a NULL pointer dereference and
produce a kernel panic / system crash.
24. NULL Pointer Dereference
struct my_struct *tmp;
tmp->get_counter();
struct my_struct {
int counter;
int (*get_counter) ();
};
void escalate_privs()
{
/* Do hacker things */
return;
}
0xFFFFFFFFFFFFFFFF
0x0000000000000000
Kernel Space
User Space
TASK_SIZE
32. SMAP/SMEP
SMAP
●
Supervisor Mode Access
Prevention
●
Cannot access/dereference any
pages that are userspace (U=1)
while CPU is running in
privileged mode (CPL=0)
●
Uses AC flag in EFLAGS register
●
Two new instructions
CLAC/STAC
●
21th bit in CR4 register
SMEP
●
Supervisor Mode Execution
Prevention
●
Cannot execute code from any
pages that are userspace (U=1)
while the CPU is running in
privileged mode (CPL=0)
●
20th bit in CR4 register
●
Older/more common than
SMAP
These are implemented at hardware level. Its the CPU itself
who enforces the protection (such as the NX bit).
33. SMAP/SMEP
struct my_struct *tmp;
tmp->get_counter();
struct my_struct {
int counter;
int (*get_counter) ();
};
void escalate_privs()
{
/* Do hacker things */
return;
}
0xFFFFFFFFFFFFFFFF
0x0000000000000000
Kernel Space
User Space
TASK_SIZE
SMEP
SMAP
34. KPTI (KAISER)
Kernel Page Table Isolation was designed to try to stop
leaks caused by side channel attacks such as meltdown or
spectre.
●
Two sets of page tables are maintained once used while runing
in privileged mode and another one for unprivileged mode.
●
Userspace PTEs hide most of the kernel mappings, reducing it
to a minimum: kernel code for handling syscalls, interrupts, etc.
●
Kernel PTEs mark the hole user space area as non-executable.
This turns out to be something like a software implementation
of SMEP.
●
Impact on performance from 5% to 30%. Mostly due to TLB
flushing. It can be improved on CPUs supporting PCID.
Source: https://es.wikipedia.org/wiki/Aislamiento_de_tablas_de_páginas_del_núcleo
35. KPTI (KAISER)
Kernel space
User space
Kernel space
User space
Kernel space
User space
No KPTI KPTI
Kernel PTEs Userspace PTEs
36. KPTI (KAISER)
struct my_struct *tmp;
tmp->get_counter();
struct my_struct {
int counter;
int (*get_counter) ();
};
void escalate_privs()
{
/* Do hacker things */
return;
}
0xFFFFFFFFFFFFFFFF
0x0000000000000000
Kernel Space
User Space
TASK_SIZE
NX
38. KMEM CACHES
kmalloc() is the normal method of allocating memory for objects
smaller than page size in the kernel.
These objects will be stored in different caches based on its size,
or the cache requested.
39. KMEM CACHES
object object object object object object object object
One full SLAB at kmalloc-192 cache contains 20 chunks
Full SLAB can be one or more pages
One chunk is
192 bytes in size
40. Allocators
The Linux kernel has implemented different allocators over the
years.
Nowadays we have 3 different allocators available. You must
choose one of them at compilation time:
●
SLAB: The first one
●
SLUB: Smaller memory footprint and less locking
●
SLOB: Aimed at embedded systems
Each allocator have its own way to track free and allocated
objects, local cpu caches, etc...
49. Our bug
I wont be giving any details on the bug itself, so you will
have to believe me…
Its a race condition, and the code displayed is just an
example...
50. Our exploit
Our base bug is a race condition which causes corruption
onto a circular doubly linked list.
An outline of the exploitation is something like:
1. Use the race condition to add the same item twice into the list
2. Free the item which removes it from the list just once
3. Linked list still contains one of the entries but the item is freed
4. We now have an Use-After-Free
5. Repeat steps 1 to 4, to produce a second UAF
6. Use both UAF in the list to achieve a write-what-where
primitive
So somehow we turned a race condition into a write-what-
were.
59. Removing the 2nd item
list.next
list.prev
next
prev
list_head
vuln_subsys (1)
60. Abusing the race condition
What would happen if we call add_vuln_subsys()
at the same time from two different processes
with the same vuln_subsys *entry ?
61. Abusing the race condition
list.next
list.prev
next
prev
list_head
vuln_subsys (1)
Adding the same entry twice. First add:
62. Abusing the race condition
Adding the same entry twice. Second add:
list.next
list.prev
next
prev
list_head
vuln_subsys (1)
63. Abusing the race condition
After the double add, we free the vuln_subsys
*entry. Which will remove the entry from the list
and free the obj.
64. Abusing the race condition
list.next
list.prev
next
prev
list_head
vuln_subsys (1)
0xdead000000200200
65. Abusing the race condition
If you remember the free_vuln_subsys() function,
it does not just remove the item from the list, but
it also frees the object.
So we now have an use-after-free.
67. Abusing the race condition
list.next
list.prev
next
prev
list_head
vuln_subsys (1)
0xdead000000200200
The chunk
is now free
68. Abusing the race condition
Usually from here you would try to allocate over the
recently freed memory chunk with data under your control.
Later call to some code path which would make use of that
freed object, and try to take control of the system.
Common ways to take control are overwriting a function
pointer on the structure/object if there is one.
In this case there was just a work queue, which we could
redirect to a fake work queue under our control. Sadly that
queue is only triggered after some special events, so we
need to find another way….
69. UAF / Heap Spraying
In order to exploit bugs related to dynamic memory
management such as use after frees usually you want to
find a mechanism that meets the following criteria:
●
Be able to allocate and deallocate memory at will.
●
Control the allocation size to be able to hit different
caches.
●
You will often find that spraying is required. So having
high limits is desired.
●
It wont add any data before or after your allocation
I tend to use two ways to do such a thing.
70. UAF / Heap Spraying
System V message queues – msgsnd()
We can create a new message queue, and send messages
into the queue to allocate, and consume messages from
the queue to deallocate.
●
First 48 bytes of the allocation will be used for the
msg_msg structure, and we do not fully control that
struct
●
There are limits on max number of message queues
which can be created, and also on the max number of
messages queued for each queue.
●
There are not restrictions on NULL bytes or anything.
●
We can use mtype to mark messages.
71. UAF / Heap Spraying
Key Management – add_key() keyctl()
We can create new keys to allocate, and delete keys to
deallocate. We can use the key name or the key data to
store data. I normally use the key name.
●
No data is added before our buffer, but it appends a NULL
byte at the end
●
Very limited by key quota around 200 keys or 20000 bytes
●
Maximum allocation is around 4093 bytes
●
Key names cannot be equal (we need to change 1 byte at
least)
●
NULL bytes are not allowed
73. Abusing the UAF
Looks like there is no obvious way to quickly
exploit this…. (You have to believe me)
So… What if we add another entry into the
doubly linked list?
75. Abusing the UAF
By adding a second entry into the linked list,
we will overwrite the first entry’s list.prev field.
In this case the list.prev field is at offset 176.
The data written will be the address of the
previous object in the list plus 168 bytes
which is the offset where the list_head field is
held within the object.
So now we have a new exploit primitive, an 8
bytes write of uncontrolled data at offset 176.
76. Abusing the UAF
So now that we know we can overwrite with 8 bytes
of uncontrolled data at offset 176, we have to search
for the following:
1. A structure which will be allocated in the kmalloc-192
cache
2. With a pointer or something like at offset 176 which we
will be overwriting
3. Which can be allocated an deallocated at will
Finding such case is not easy, as the size-192 cache covers
objects from 128 to 192 bytes, and writing at offset 176,
leaves us just with objects/structures which are 184 to 192
bytes in size.
77. Abusing the UAF
Several ideas to look for candidates:
●
Search for structures between 184 and 192 bytes in
size which contain a pointer at offset 176 and are
part of the kernel core.
●
Allocate a 176 bytes string, and overwrite the null
terminator with our primitive. (more crazy shit)
●
Exercise the kernel and examine the kmalloc-192
cache. Developing some scripts is required...
80. Abusing the UAF
Ended up developing some scripts to do the
following tasks:
●
Exercise the kernel as much as we can
●
Extract the stacktraces of all the allocations
in the kmalloc-192 cache
●
Sum all the allocations and remove
duplicates
At last we should manually review all the
unique stacktraces and looks for useful ones...
83. Abusing the UAF
After a lot of research, we found a suitable
candidate. It involves using IPC System V
shared memory segments.
This meets all the criteria previously defined:
●
A structure which will be allocated in the kmalloc-192
cache
●
With a pointer or something like at offset 176 which we
will be overwriting
●
Which can be allocated an deallocated at will
90. Abusing the UAF again
shm_cprid
shm_lprid
*mlock_user
next
prev
list_head
vuln_subsys (1)
shmid_kernel
list.next
list.prev
vuln_subsys (2)
0xdead000000200200
91. Abusing the UAF again
shm_cprid
shm_lprid
*mlock_user
next
prev
list_head
vuln_subsys (1)
shmid_kernel
__count
processes
files
...
inotify_devs
vuln_subsys (2)
fake mlock_user
alloc_msgsnd(char *)&fake_mlock_user, 140);
92. Abusing the race condition
If we take a look to the mlock_user field it points to an
struct user_struct which we will store at the second UAF
object. It is 104 bytes in length.
But mlock_user will point to offset 168 inside a 192 bytes
chunk. If we do the math: 168 + 104 = 272. So when
reading this structure we will be reading past the 192 bytes
chunk, and reading from the next chunk….
95. Don’t worry, there are more problems...
Allocating over the freed chunks might look easy at first,
but it is not. Memory allocation its not a deterministic
process, there are many other tasks in the kernel, processes
in the system, interrupts, etc… which will be competing
against you to allocate over and changing the layout of the
kmem cache.
There are local objects caches per cpu for fast/lockless
allocation and deallocation paths. So binding to a cpu is
required.
Your success can vary depending on memory
fragmentation, memory pressure, system load, etc…
Your task will be turn something non-deterministic and
unreliable by nature, into something more or less reliable…
96. Don’t worry, there are more problems...
Our fake mlock_user
structure has been allocated
here.
When reading past the first 24 bytes
of the structure, we will start
reading into this next chunk.
97. Don’t worry, there are more problems...
Original structure
declaration
Structure declaration aligned to
offset 168 for kmalloc-192 cache
New structure start offset (168)
98. kmalloc-192 slab
One full SLAB at kmalloc-192 cache contains 20 chunks
One chunk is
192 bytes in size
The goal for our heap massage is to control a hole slab with
our custom data, so whenever the read overflow happens,
data being read will be under our control. Or at least it will
probably be under our control…..
99. kmalloc-192 partial slabs
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FREE ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FREE ALLOCATED ALLOCATED ALLOCATED ALLOCATED FREE ALLOCATED
ALLOCATED ALLOCATED FREE FREE ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FREE FREE FREE FREE ALLOCATED ALLOCATED
Partialslabs
First we will fill all those gaps in order to force the creation
of a new hole slab which is the one we will be trying to
take control of.
We can read /proc/slabinfo if available in order to find how
many partial slabs and objects exists.
Use System V message queues to fill gaps. msgsnd()
100. kmalloc-192 filled slabs
Next we will fill two new slabs containing or “unaligned”
fake mlock_struct structures data.
To do the allocation, we can use the key managament
facility to do this. Using this method, we cannot use null
bytes.
When using this method we are limited by the key quota.
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED
ALLOCATED ALLOCATED FILL FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FILL FILL FILL FILL ALLOCATED ALLOCATED
Fullslabs
101. kmalloc-192 filled slabs
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED
ALLOCATED ALLOCATED FILL FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FILL FILL FILL FILL ALLOCATED ALLOCATED
Fullslabs
alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key
Allocate two full slabs (40 items) with fake
user_struct using `alloc_key_name()` method
alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key
Newslabs
102. kmalloc-192 create gaps
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED
ALLOCATED ALLOCATED FILL FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FILL FILL FILL FILL ALLOCATED ALLOCATED
Fullslabs
alloc_key alloc_key HOLE alloc_key HOLE alloc_key HOLE alloc_key
Create some gaps into the slabs by freeing keys
alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key
Newslabs
103. kmalloc-192 create gaps
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED
ALLOCATED ALLOCATED FILL FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FILL FILL FILL FILL ALLOCATED ALLOCATED
Fullslabs
alloc_key alloc_key vuln_subsys alloc_key HOLE alloc_key HOLE alloc_key
Allocate a new vuln_subsys (2)
alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key
Newslabs
104. kmalloc-192 create gaps
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED
ALLOCATED ALLOCATED FILL FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FILL FILL FILL FILL ALLOCATED ALLOCATED
Fullslabs
alloc_key alloc_key vuln_subsys alloc_key HOLE alloc_key HOLE alloc_key
Trigger the race and free the vuln_subsys (2) to
produce an UAF
alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key
Newslabs
105. kmalloc-192 create gaps
ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED
ALLOCATED FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED FILL ALLOCATED
ALLOCATED ALLOCATED FILL FILL ALLOCATED ALLOCATED ALLOCATED ALLOCATED
ALLOCATED ALLOCATED FILL FILL FILL FILL ALLOCATED ALLOCATED
Fullslabs
alloc_key alloc_key vuln_subsys
msgsnd
alloc_key HOLE alloc_key HOLE alloc_key
Allocate over the UAF another fake mlock_user
Using msgsnd() method
alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key alloc_key
Newslabs
106. kmalloc-192 create gaps
First chunk allocated
using alloc_msgbuf
And containing a null
byte
Second chunk containing the
overflowed data. Allocated
using alloc_key_name
Structure will start being
read from offset 167
Read will continue from
offset 0 on the next
chunk
107. write-what-where
So we have a shmid_kernel structure pointing to an
user_struct structure fully controlled by us. What can be
done with it?
Lets take another look to the user_struct structure:
108. write-what-where
This piece of code can be triggered when destroying the shared memory
segment. But we have to meet several conditions:
●
The shared memory segment should have been locked. Otherwise the
user_struct structure is ignored.
●
The __counter field on the user_struct structure must be set to 1. This
way free_user() function will be called when the reference counter is
decreased to 0.
109. privilege escalation
Now that we have a proper write-what-where, we can write 8 bytes of
data anywhere in the memory.
There are many things we can overwrite in order to escalate privileges.
Among the most common ways is to overwrite a file_operations
structure. There are tons of them, are easily located in memory as they
are defined as globals, and they contain tons of function pointers.
I personally do not use this method. But lets suppose I do….
110. privilege escalation
Our target will be the file operations for “/dev/ptmx”. And we will
overwrite the llseek function pointer, as it points to no_llseek in the
ptmx_fops structure, and nobody should call to it...
111. privilege escalation
If there were no protections in place, we could simply point
tty_fops.llseek into an userspace address with our code. But it is not
1999 any more…
So what we are going to do, is point it to a rop gadget which will pivot
the stack. Lets take a look to the llseek() function prototype:
When calling llseek() from userspace we can control offset and origin
fields, which will be at rsi and rdx respectively. So gadgets such as the
following ones should do the job:
loff_t no_llseek(struct file *file, loff_t offset, int origin)
mov rsp, rsi; ret;
xchg rsi, rsp; ret;
112. privilege escalation
Triggering the stack pivot should be as easy as running this code from
userspace:
The kernel stack will point to 0xdeadbeefdeadbeef where we have our
full ropchain to disable SMAP/SMEP or mark a kernel page from the
direct mapping range as executable.
Sample ROP chain:
113. privilege escalation
Although our fake stack, and the escalation code are part of our exploit code
which is running in userspace, we are referencing to it using addresses from
kernel space.
There are some ways to find were our userspace memory is mapped into the
direct mapping of the kernel space.
This direct mapping is marked as non-executable. That’s why we use the ROP
chain to mark the page containing our code as executable prior jumping to it.
115. Cleanup
To finish up with the exploitation is necessary to cleanup any
mess to keep the kernel stable, otherwise all your work will be
worthless….
●
Fix any kmem caches we might have corrupted
●
Restore any pointers overwritten
●
Restore the stack so the kernel will properly return to
userspace
●
Fix any other mess you may have made….