SlideShare ist ein Scribd-Unternehmen logo
1 von 122
Downloaden Sie, um offline zu lesen
BPF Internals
Brendan Gregg
USENIX
Jun, 2021
Tracing Examples
(eBPF)
BPF Internals (Brendan Gregg)
(it’s actually quite easy)
BPF Internals (Brendan Gregg)
Intro & Tracing
By Example
1. Dynamic tracing and per-event output
2. Static tracing and map summaries
For BPF internals by reference, see the References slide
Learning objectives
●
Gain a working knowledge of some BPF internals
●
Evaluate ideas for BPF suitability
Agenda
Slides: http://www.brendangregg.com/Slides/LISA2021_BPF_Internals.pdf
Video: https://www.usenix.org/conference/lisa21/presentation/gregg-bpf
Slides are online
BPF Internals (Brendan Gregg)
BPF Intro
BPF Internals (Brendan Gregg)
BPF 1992: Berkeley Packet Filter
A runtime for efficient
packet filters
Also a narrow and arcane
kernel technology that few
knew existed
# tcpdump -d host 127.0.0.1 and port 80
(000) ldh [12]
(001) jeq #0x800 jt 2 jf 18
(002) ld [26]
(003) jeq #0x7f000001 jt 6 jf 4
(004) ld [30]
(005) jeq #0x7f000001 jt 6 jf 18
(006) ldb [23]
(007) jeq #0x84 jt 10 jf 8
(008) jeq #0x6 jt 10 jf 9
(009) jeq #0x11 jt 10 jf 18
(010) ldh [20]
(011) jset #0x1fff jt 18 jf 12
(012) ldxb 4*([14]&0xf)
(013) ldh [x + 14]
(014) jeq #0x50 jt 17 jf 15
(015) ldh [x + 16]
(016) jeq #0x50 jt 17 jf 18
(017) ret #262144
(018) ret #0
BPF Internals (Brendan Gregg)
eBPF 2013+
Extended BPF (eBPF) modernized BPF
Maintainers/creators: Alexei Starovoitov & Daniel Borkmann
Old BPF is now “Classic BPF,” and eBPF is usually just “BPF”
Classic BPF Extended BPF
Word size 32-bit 64-bit
Registers 2 10+1
Storage 16 slots 512 byte stack + infinite map storage
Events packets many event sources
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
BPF 2021
BPF is now a technology name (like LLVM)
●
Some still call it eBPF
A generic in-kernel execution environment
●
User-defined programs
●
Limited & secure kernel access
●
A new type of software
BPF Internals (Brendan Gregg)
A New Type of Software
Execution
model
User
defined
Compil-
ation
Security Failure
mode
Resource
access
User task yes any user based abort syscall,
fault
Kernel task no static none panic direct
BPF event yes JIT,
CO-RE
verified,
JIT
error
message
restricted
helpers
BPF Internals (Brendan Gregg)
BPF program state model
Loaded
Enabled
event fires
program ended
Off-CPU On-CPU
BPF
attach
Kernel
helpers
Spinning
spin lock
Sleeping
preempt
(restricted state)
page fault
BPF Internals (Brendan Gregg)
BPF Tracing
BPF Internals (Brendan Gregg)
BPF tracing (observability) tools
BPF Internals (Brendan Gregg)
I want to run some tools
●
bcc, bpftrace /usr/bin/*
I want to hack up some new tools
●
bpftrace bash, awk
I want to spend weeks developing a BPF product
●
bcc libbpf C, bcc Python (maybe), gobpf, libbbpf-rs C, C++
Unix analogies
Recommended BPF tracing front-ends
Requires LLVM; becoming
obsolete / special-use only
New, lightweight,
CO-RE & BTF based
BPF Internals (Brendan Gregg)
BPF Internals
(developing BPF was hard;
understanding it is easy)
BPF Internals (Brendan Gregg)
BPF tracing/observability high-level
From: BPF Performance Tools, Figure 2-1
BPF Internals (Brendan Gregg)
AST: Abstract Syntax Tree
LLVM: A compiler
IR: Intermediate Representation
JIT: Just-in-time compilation
kprobes: Kernel dynamic instrumentation
uprobes: User-level dynamic instrumentation
tracepoints: Kernel static instrumentation
Terminology
BPF Internals (Brendan Gregg)
1. Dynamic tracing and per-event output
BPF Internals (Brendan Gregg)
bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}'
1. Dynamic tracing and per-event output
BPF Internals (Brendan Gregg)
# bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}'
Attaching 1 probe...
PID 10287 sleeping...
PID 10297 sleeping...
PID 10287 sleeping…
PID 10297 sleeping...
PID 10287 sleeping...
PID 2218 sleeping...
PID 10297 sleeping...
[...]
Example output
BPF Internals (Brendan Gregg)
bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}'
Objective
We have this:
We want:
- BPF bytecode
- Kernel events mapped to the bytecode
- User space printing events
This is learning internals by example, including bpftrace internals.
For a complete internals see the References slide at end.
BPF Internals (Brendan Gregg)
bpftrace mid-level internals
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
Program transformations
AST
bpftrace
program
LLVM IR BPF
machine
code
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 1/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}
bpftrace program
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 2/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}
Converting to AST
call(printf)
probe(kprobe:do_nanosleep)
builtin(pid)
string(“PID %d sleeping...n”)
BPF Internals (Brendan Gregg)
Parsing bpftrace
lex yacc
bpftrace
program
text
AST
src/lexer.l
(regular
expressions)
src/parser.yy
(grammer)
(very easy)
BPF Internals (Brendan Gregg)
ident [_a-zA-Z][_a-zA-Z0-9]*
map @{ident}|@
var ${ident}
int [0-9]+|0[xX][0-9a-fA-F]+
cint :{int}:
hex (x|X)[0-9a-fA-F]{1,2}
[...]
builtin arg[0-9]|args|cgroup|comm|cpid|cpu|ctx|curtask|elapsed|func|
gid|nsecs|pid|probe|rand|retval|sarg[0-9]|tid|uid|username
call avg|cat|cgroupid|clear|count|delete|exit|hist|join|kaddr|ksym|
lhist|max|min|ntop|override_return|print|printf|reg|signal|stats|str|
strncmp|sum|sym|system|time|uaddr|usym|zero
pid
Lexer
bpftrace src/lexer.l
Regular expressions
BPF Internals (Brendan Gregg)
builtin(pid)
Yacc
%token <std::string> BUILTIN "builtin"
%token <std::string> CALL "call"
[...]
expr : int { $$ = $1; }
| STRING { $$ = new ast::String($1, @$); }
| BUILTIN { $$ = new ast::Builtin($1, @$); }
| CALL_BUILTIN { $$ = new ast::Builtin($1, @$); }
| IDENT { $$ = new ast::Identifier($1, @$); }
| STACK_MODE { $$ = new ast::StackMode($1, @$); }
| ternary { $$ = $1; }
| param { $$ = $1; }
| map_or_var { $$ = $1; }
| call { $$ = $1; }
[...]
call : CALL "(" ")" { $$ = new ast::Call($1, @$); }
| CALL "(" vargs ")" { $$ = new ast::Call($1, $3, @$); }
bpftrace src/parser.yy
Grammar rules
BPF Internals (Brendan Gregg)
ident [_a-zA-Z][_a-zA-Z0-9]*
map @{ident}|@
var ${ident}
int [0-9]+|0[xX][0-9a-fA-F]+
cint :{int}:
hex (x|X)[0-9a-fA-F]{1,2}
[...]
builtin arg[0-9]|args|cgroup|comm|cpid|cpu|ctx|curtask|elapsed|func|
gid|nsecs|pid|probe|rand|retval|sarg[0-9]|tid|uid|username
call avg|cat|cgroupid|clear|count|delete|exit|hist|join|kaddr|ksym|
lhist|max|min|ntop|override_return|print|printf|reg|signal|stats|str|
strncmp|sum|sym|system|time|uaddr|usym|zero
printf("PID %d sleeping...n", pid);
Lexer (2)
bpftrace src/lexer.l
BPF Internals (Brendan Gregg)
call(printf(...));
Yacc (2)
%token <std::string> BUILTIN "builtin"
%token <std::string> CALL "call"
[...]
expr : int { $$ = $1; }
| STRING { $$ = new ast::String($1, @$); }
| BUILTIN { $$ = new ast::Builtin($1, @$); }
| CALL_BUILTIN { $$ = new ast::Builtin($1, @$); }
| IDENT { $$ = new ast::Identifier($1, @$); }
| STACK_MODE { $$ = new ast::StackMode($1, @$); }
| ternary { $$ = $1; }
| param { $$ = $1; }
| map_or_var { $$ = $1; }
| call { $$ = $1; }
[...]
call : CALL "(" ")" { $$ = new ast::Call($1, @$); }
| CALL "(" vargs ")" { $$ = new ast::Call($1, $3, @$); }
bpftrace src/parser.yy
BPF Internals (Brendan Gregg)
" { yy_push_state(STR, yyscanner); buffer.clear(); }
<STR>{
" { yy_pop_state(yyscanner); return Parser::make_STRING(buffer, loc); }
[^n"]+ buffer += yytext;
n buffer += 'n';
t buffer += 't';
r buffer += 'r';
" buffer += '"';
 buffer += '';
{oct} {
long value = strtol(yytext+1, NULL, 8);
if (value > UCHAR_MAX)
driver.error(loc, std::string("octal escape sequence out of range '") +
yytext + "'");
buffer += value;
}
{hex} buffer += strtol(yytext+2, NULL, 16);
n driver.error(loc, "unterminated string"); yy_pop_state(yyscanner);
loc.lines(1); loc.step();
<<EOF>> driver.error(loc, "unterminated string"); yy_pop_state(yyscanner);
. { driver.error(loc, std::string("invalid escape character '") +
yytext + "'"); }
. driver.error(loc, "invalid character"); yy_pop_state(yyscanner);
}
"PID %d sleeping...n"
Lexer (3)
bpftrace src/lexer.l
BPF Internals (Brendan Gregg)
string("PID %d sleeping...n")
Yacc (3)
%token <std::string> STRING "string"
[...]
expr : int { $$ = $1; }
| STRING { $$ = new ast::String($1, @$); }
| BUILTIN { $$ = new ast::Builtin($1, @$); }
| CALL_BUILTIN { $$ = new ast::Builtin($1, @$); }
| IDENT { $$ = new ast::Identifier($1, @$); }
| STACK_MODE { $$ = new ast::StackMode($1, @$); }
| ternary { $$ = $1; }
| param { $$ = $1; }
| map_or_var { $$ = $1; }
| call { $$ = $1; }
[...]
bpftrace src/parser.yy
BPF Internals (Brendan Gregg)
ident [_a-zA-Z][_a-zA-Z0-9]*
map @{ident}|@
[...]
kprobe:do_nanosleep
Lexer & Yacc (4)
bpftrace src/lexer.l
attach_point : ident { $$ = new ast::AttachPoint($1, @$); }
| ident ":" wildcard { $$ = new ast::AttachPoint($1, $3, @$); }
| ident ":" wildcard PLUS INT { $$ = new ast::AttachPoint($1, $3, $5, @$); }
| ident PATH STRING { $$ = new ast::AttachPoint($1, $2.substr(1, $2.si...
[...]
wildcard : wildcard ident { $$ = $1 + $2; }
| wildcard MUL { $$ = $1 + "*"; }
| wildcard LBRACKET { $$ = $1 + "["; }
| wildcard RBRACKET { $$ = $1 + "]"; }
| wildcard DOT { $$ = $1 + "."; }
| { $$ = ""; }
;
bpftrace src/parser.yy
BPF Internals (Brendan Gregg)
ident [_a-zA-Z][_a-zA-Z0-9]*
map @{ident}|@
[...]
kprobe:do_nanosleep
Lexer & Yacc (4)
attach_point : ident { $$ = new ast::AttachPoint($1, @$); }
| ident ":" wildcard { $$ = new ast::AttachPoint($1, $3, @$); }
| ident ":" wildcard PLUS INT { $$ = new ast::AttachPoint($1, $3, $5, @$); }
| ident PATH STRING { $$ = new ast::AttachPoint($1, $2.substr(1, $2.si...
[...]
wildcard : wildcard ident { $$ = $1 + $2; }
| wildcard MUL { $$ = $1 + "*"; }
| wildcard LBRACKET { $$ = $1 + "["; }
| wildcard RBRACKET { $$ = $1 + "]"; }
| wildcard DOT { $$ = $1 + "."; }
| { $$ = ""; }
;
1. 2.
3.
4.
5.
6.
bpftrace src/lexer.l
bpftrace src/parser.yy
BPF Internals (Brendan Gregg)
{ statements; ... }
Yacc (5)
probe : attach_points pred block { $$ = new ast::Probe($1, $2, $3); }
attach_points : attach_points "," attach_point { $$ = $1; $1->push_back($3); }
| attach_point { $$ = new ast::AttachPointList; $$->push_back($1); }
[...]
block : "{" stmts "}" { $$ = $2; }
;
semicolon_ended_stmt: stmt ";" { $$ = $1; }
;
stmts : semicolon_ended_stmt stmts { $$ = $2; $2->insert($2->begin(), $1); }
| block_stmt stmts { $$ = $2; $2->insert($2->begin(), $1); }
| stmt { $$ = new ast::StatementList; $$->push_back($1); }
| { $$ = new ast::StatementList; }
;
bpftrace src/parser.yy
Plus more grammar for program structure...
BPF Internals (Brendan Gregg)
# bpftrace -d -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}'
AST
-------------------
Program
kprobe:do_nanosleep
call: printf
string: PID %d sleeping...n
builtin: pid
[...]
Now you have AST nodes!
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 3/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
Tracepoint & Clang struct parsers
kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}
Not needed for this example
(no struct member dereferencing)
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 4/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
# bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pidd); }'
stdin:2:36-38: ERROR: Unknown identifier: 'pidd'
printf("PID %d sleeping...n", pidd);
Semantic analyzer
void SemanticAnalyser::visit(Identifier &identifier)
{
if (bpftrace_.enums_.count(identifier.ident) != 0) {
identifier.type = SizedType(Type::integer, 8);
}
else {
identifier.type = SizedType(Type::none, 0);
error("Unknown identifier: '" + identifier.ident + "'", identifier.loc);
}
}
bpftrace src/ast/semantic_analyser.cpp
Catches many program errors; E.g.:
>2000 lines of code
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 5/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
AST LLVM IR
→
void CodegenLLVM::visit(Builtin &builtin)
{
[…]
else if (builtin.ident == "pid" || builtin.ident == "tid")
{
Value *pidtgid = b_.CreateGetPidTgid();
if (builtin.ident == "pid")
{
expr_ = b_.CreateLShr(pidtgid, 32);
[…]
bpftrace src/ast/codegen_llvm.cpp
BPF logical shift right instruction
BPF Internals (Brendan Gregg)
bpftrace src/ast/irbuilderbpf.cpp
CallInst *IRBuilderBPF::CreateGetPidTgid()
{
// u64 bpf_get_current_pid_tgid(void)
// Return: current->tgid << 32 | current->pid
FunctionType *getpidtgid_func_type = FunctionType::get(getInt64Ty(),
false);
PointerType *getpidtgid_func_ptr_type =
PointerType::get(getpidtgid_func_type, 0);
Constant *getpidtgid_func = ConstantExpr::getCast(
Instruction::IntToPtr,
getInt64(libbpf::BPF_FUNC_get_current_pid_tgid),
getpidtgid_func_ptr_type);
return CreateCall(getpidtgid_func, {}, "get_pid_tgid");
}
AST LLVM IR (2)
→
BPF helper call number
BPF Internals (Brendan Gregg)
#define __BPF_FUNC_MAPPER(FN) 
FN(unspec), 
FN(map_lookup_elem), 
FN(map_update_elem), 
FN(map_delete_elem), 
FN(probe_read), 
FN(ktime_get_ns), 
FN(trace_printk), 
FN(get_prandom_u32), 
FN(get_smp_processor_id), 
FN(skb_store_bytes), 
FN(l3_csum_replace), 
FN(l4_csum_replace), 
FN(tail_call), 
FN(clone_redirect), 
FN(get_current_pid_tgid), 
FN(get_current_uid_gid), 
[...]
Linux include/uapi/linux/bpf.h
BPF helper calls
#14
BPF Internals (Brendan Gregg)
# bpftrace -d -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid); }'
[…]
define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section
"s_kprobe:do_nanosleep_1" {
entry:
%printf_args = alloca %printf_t, align 8
%1 = bitcast %printf_t* %printf_args to i8*
call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1)
%2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0
store i64 0, i64* %2, align 8
%get_pid_tgid = tail call i64 inttoptr (i64 14 to i64 ()*)()
%3 = lshr i64 %get_pid_tgid, 32
%4 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1
store i64 %3, i64* %4, align 8
%pseudo = tail call i64 @llvm.bpf.pseudo(i64 1, i64 1)
%get_cpu_id = tail call i64 inttoptr (i64 8 to i64 ()*)()
%perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*,
i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 16)
call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1)
ret i64 0
}
Now you have LLVM IR!
BPF Internals (Brendan Gregg)
# bpftrace -d -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid); }'
[…]
define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section
"s_kprobe:do_nanosleep_1" {
entry:
%printf_args = alloca %printf_t, align 8
%1 = bitcast %printf_t* %printf_args to i8*
call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1)
%2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0
store i64 0, i64* %2, align 8
%get_pid_tgid = tail call i64 inttoptr (i64 14 to i64 ()*)()
%3 = lshr i64 %get_pid_tgid, 32
%4 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1
store i64 %3, i64* %4, align 8
%pseudo = tail call i64 @llvm.bpf.pseudo(i64 1, i64 1)
%get_cpu_id = tail call i64 inttoptr (i64 8 to i64 ()*)()
%perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*,
i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 16)
call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1)
ret i64 0
}
Now you have LLVM IR! (2)
BPF Internals (Brendan Gregg)
# bpftrace -d -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid); }'
[…]
define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section
"s_kprobe:do_nanosleep_1" {
entry:
%printf_args = alloca %printf_t, align 8
%1 = bitcast %printf_t* %printf_args to i8*
call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1)
%2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0
store i64 0, i64* %2, align 8
[...]
Now you have LLVM IR! (3)
This is all generated from LLVM IR calls in bpftrace src/ast/*
Lots of CreateAllocaBPF(), CreateGEP(), etc. (this gets verbose)
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 6/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
Extended BPF instruction (bytecode) format
opcode
dest
reg
src
reg
signed
offset
signed
immediate constant
8-bit 4-bit 4-bit 16-bit 32-bit
← 64-bit →
opcode src
inst.
class
4-bit 1-bit 3-bit
E.g., for
ALU & JMP
classes:
BPF Internals (Brendan Gregg)
Extended BPF instruction (bytecode) format (2)
opcode
dest
reg
src
reg
signed
offset
signed
immediate constant
8-bit 4-bit 4-bit 16-bit 32-bit
opcode src
inst.
class
4-bit 1-bit 3-bit
E.g., for
ALU & JMP
classes:
E.g., call get_current_pid_tgid
14
BPF_CALL BPF_JMP
BPF Internals (Brendan Gregg)
Extended BPF instruction (bytecode) format (3)
opcode
dest
reg
src
reg
signed
offset
signed
immediate constant
8-bit 4-bit 4-bit 16-bit 32-bit
opcode src
inst.
class
4-bit 1-bit 3-bit
E.g., for
ALU & JMP
classes:
E.g., call get_current_pid_tgid
14
BPF_CALL BPF_JMP
#define BPF_JMP 0x05
#define BPF_CALL 0x80
Linux include/uapi/linux/bpf.h
Linux include/uapi/linux/bpf_common.h
BPF Internals (Brendan Gregg)
Extended BPF instruction (bytecode) format (4)
opcode
dest
reg
src
reg
signed
offset
signed
immediate constant
8-bit 4-bit 4-bit 16-bit 32-bit
opcode src
inst.
class
4-bit 1-bit 3-bit
E.g., for
ALU & JMP
classes:
E.g., call get_current_pid_tgid
0xe0 0x00 0x00 0x00
BPF_CALL BPF_JMP
#define BPF_JMP 0x05
#define BPF_CALL 0x80
Linux include/uapi/linux/bpf.h
Linux include/uapi/linux/bpf_common.h
0x85 0x0 0x0 0x00 0x00
BPF Internals (Brendan Gregg)
Extended BPF instruction (bytecode) format (5)
E.g., call get_current_pid_tgid
(hex) 85 00 00 e0 00 00 00
As per the BPF specification (currently Linux headers)
BPF Internals (Brendan Gregg)
LLVM/Clang has a BPF target
LLVM
IR
--target bpf
BPF
bytecode
Future: bpftrace may include its own lightweight
bpftrace compiler (BC) as an option
(pros: no dependencies; cons: less optimal code)
BPF Internals (Brendan Gregg)
LLVM/Clang has a BPF target (2)
LLVM
LLVM
IR
BPF
bytecode
LLVM
BPF target
Linux include/uapi/linux/bpf_common.h
Linux include/uapi/linux/bpf.h
Linux include/uapi/linux/filter.h
BPF
specification
(#defines)
BPF Internals (Brendan Gregg)
LLVM llvm/lib/Target/BPF/BPFInstrInfo.td
class CALL<string OpcodeStr>
: TYPE_ALU_JMP<BPF_CALL.Value, BPF_K.Value,
(outs),
(ins calltarget:$BrDst),
!strconcat(OpcodeStr, " $BrDst"),
[]> {
bits<32> BrDst;
let Inst{31-0} = BrDst;
let BPFClass = BPF_JMP;
}
Plus more llvm boilerplate & BPF headers shown earlier
E.g., tail call i64 inttoptr (i64 14 to i64 ()*)()
14
85 00 00 e0 00 00 00
LLVM IR BPF
→
BPF Internals (Brendan Gregg)
bf 16 00 00 00 00 00 00
b7 01 00 00 00 00 00 00
7b 1a f0 ff 00 00 00 00
85 00 00 00 0e 00 00 00
77 00 00 00 20 00 00 00
7b 0a f8 ff 00 00 00 00
18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00
85 00 00 00 08 00 00 00
bf a4 00 00 00 00 00 00
07 04 00 00 f0 ff ff ff
bf 61 00 00 00 00 00 00
bf 72 00 00 00 00 00 00
bf 03 00 00 00 00 00 00
b7 05 00 00 10 00 00 00
85 00 00 00 19 00 00 00
b7 00 00 00 00 00 00 00
95 00 00 00 00 00 00 00
Now you have BPF bytecode!
BPF Internals (Brendan Gregg)
bf 16 00 00 00 00 00 00
b7 01 00 00 00 00 00 00
7b 1a f0 ff 00 00 00 00
85 00 00 00 0e 00 00 00
77 00 00 00 20 00 00 00
7b 0a f8 ff 00 00 00 00
18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00
85 00 00 00 08 00 00 00
bf a4 00 00 00 00 00 00
07 04 00 00 f0 ff ff ff
bf 61 00 00 00 00 00 00
bf 72 00 00 00 00 00 00
bf 03 00 00 00 00 00 00
b7 05 00 00 10 00 00 00
85 00 00 00 19 00 00 00
b7 00 00 00 00 00 00 00
95 00 00 00 00 00 00 00
Now you have BPF bytecode! (2)
14 (BPF_FUNC_get_current_pid_tgid)
0x05 (BPF_JMP) | 0x80 (BPF_CALL)
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 7/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
# strace -fe bpf bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}'
[...]
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7fdde5305000,
license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8,
18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0,
expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0,
func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0,
attach_btf_id=0, attach_prog_fd=0}, 120) = 14
Sending BPF bytecode to the kernel
Success! Passed the verifier...
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 8/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
BPF mid-level internals
From: BPF Performance Tools, Figure 2-3
BPF Internals (Brendan Gregg)
85 00 00 00 12 34 56 78
Verifying BPF instructions
If ...
static int do_check(struct bpf_verifier_env *env)
[...]
} else if (class == BPF_JMP || class == BPF_JMP32) {
u8 opcode = BPF_OP(insn->code);
env->jmps_processed++;
if (opcode == BPF_CALL) {
[...]
err = check_helper_call(env, insn->imm, env->insn_idx);
[...]
static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx)
{
const struct bpf_func_proto *fn = NULL;
struct bpf_reg_state *regs;
struct bpf_call_arg_meta meta;
bool changes_data;
int i, err;
/* find function prototype */
if (func_id < 0 || func_id >= __BPF_FUNC_MAX_ID) {
verbose(env, "invalid func %s#%dn", func_id_name(func_id),
func_id);
return -EINVAL;
Linux kernel/bpf/verifier.c
>9000 lines of code
Imagine we call a bogus function...
BPF Internals (Brendan Gregg)
>9000 lines of code
>260 error returns
Checks every instruction
Checks every code path
Rewrites some bytecode
BPF verifier
check_helper_mem_access
check_func_arg
check_map_func_compatibility
check_func_proto
check_func_call
check_reference_leak
check_helper_call
check_alu_op
check_cond_jmp_op
check_ld_imm
check_ld_abs
check_return_code
check_cfg
check_btf_func
check_btf_line
check_btf_info
check_map_prealloc
check_map_prog_compatibility
check_struct_ops_btf_id
check_attach_modify_return
check_attach_btf_id
Verifier functions:
check_subprogs
check_reg_arg
check_stack_write
check_stack_read
check_stack_access
check_map_access_type
check_mem_region_access
check_map_access
check_packet_access
check_ctx_access
check_flow_keys_access
check_sock_access
check_pkt_ptr_alignment
check_generic_ptr_alignment
check_ptr_alignment
check_max_stack_depth
check_tp_buffer_access
check_ptr_to_btf_access
check_mem_access
check_xadd
check_stack_boundary
BPF Internals (Brendan Gregg)
●
Memory access
– Direct access extremely restricted
– Can only read initialized memory
– Other kernel memory must pass through
the bpf_probe_read() helper and its checks
●
Arguments are the correct type
●
Register usage allowed
– E.g., no frame pointer writes
●
No write overflows
●
No addr leaks
●
Etc.
Verifying Instructions
STACK
CTX
SOCKET
MAP_VALUE
LOAD
STORE
Memory
bpf_probe_read
random addr
BPF Internals (Brendan Gregg)
Verifying Code Paths
●
All instruction must lead to exit
●
No unreachable instructions
●
No backwards branches (loops)
except BPF bounded loops
biolatency as GraphViz dot:
BPF Internals (Brendan Gregg)
Verifying Code Paths
●
All instruction must lead to exit
●
No unreachable instructions
●
No backwards branches (loops)
except BPF bounded loops
biolatency as GraphViz dot:
BPF Internals (Brendan Gregg)
bf 16 00 00 00 00 00 00
b7 01 00 00 00 00 00 00
7b 1a f0 ff 00 00 00 00
85 00 00 00 0e 00 00 00
77 00 00 00 20 00 00 00
7b 0a f8 ff 00 00 00 00
18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00
85 00 00 00 08 00 00 00
bf a4 00 00 00 00 00 00
07 04 00 00 f0 ff ff ff
bf 61 00 00 00 00 00 00
bf 72 00 00 00 00 00 00
bf 03 00 00 00 00 00 00
b7 05 00 00 10 00 00 00
85 00 00 00 19 00 00 00
b7 00 00 00 00 00 00 00
95 00 00 00 00 00 00 00
Pre-verifier BPF bytecode
BPF Internals (Brendan Gregg)
bf 16 00 00 00 00 00 00
b7 01 00 00 00 00 00 00
7b 1a f0 ff 00 00 00 00
85 00 00 00 d0 81 01 00
77 00 00 00 20 00 00 00
7b 0a f8 ff 00 00 00 00
18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00
85 00 00 00 f0 80 01 00
bf a4 00 00 00 00 00 00
07 04 00 00 f0 ff ff ff
bf 61 00 00 00 00 00 00
bf 72 00 00 00 00 00 00
bf 03 00 00 00 00 00 00
b7 05 00 00 10 00 00 00
85 00 00 00 30 2c ff ff
b7 00 00 00 00 00 00 00
95 00 00 00 00 00 00 00
Post-verifier BPF bytecode
BPF Internals (Brendan Gregg)
bf 16 00 00 00 00 00 00
b7 01 00 00 00 00 00 00
7b 1a f0 ff 00 00 00 00
85 00 00 00 d0 81 01 00
77 00 00 00 20 00 00 00
7b 0a f8 ff 00 00 00 00
18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00
85 00 00 00 f0 80 01 00
bf a4 00 00 00 00 00 00
07 04 00 00 f0 ff ff ff
bf 61 00 00 00 00 00 00
bf 72 00 00 00 00 00 00
bf 03 00 00 00 00 00 00
b7 05 00 00 10 00 00 00
85 00 00 00 30 2c ff ff
b7 00 00 00 00 00 00 00
95 00 00 00 00 00 00 00
Post-verifier BPF bytecode (2)
E.g., call get_current_pid_tgid
helper index value has become an instruction
offset addresses from __bpf_call_base
BPF Internals (Brendan Gregg)
# bpftool prog show
[…]
70: kprobe name do_nanosleep tag 8dc93a3b6a21ef3b gpl
loaded_at 2021-05-02T00:44:26+0000 uid 0
xlated 144B jited 96B memlock 4096B map_ids 24
# bpftool prog dump xlated id 70 opcodes
0: (bf) r6 = r1
bf 16 00 00 00 00 00 00
1: (b7) r1 = 0
b7 01 00 00 00 00 00 00
2: (7b) *(u64 *)(r10 -16) = r1
7b 1a f0 ff 00 00 00 00
3: (85) call bpf_get_current_pid_tgid#98768
85 00 00 00 d0 81 01 00
4: (77) r0 >>= 32
77 00 00 00 20 00 00 00
5: (7b) *(u64 *)(r10 -8) = r0
7b 0a f8 ff 00 00 00 00
6: (18) r7 = map[id:24]
18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00
[...]
BPF bytecode with human words
Using bpftool
on a running
instance of the
program
BPF Internals (Brendan Gregg)
# bpftool prog show
[…]
70: kprobe name do_nanosleep tag 8dc93a3b6a21ef3b gpl
loaded_at 2021-05-02T00:44:26+0000 uid 0
xlated 144B jited 96B memlock 4096B map_ids 24
# bpftool prog dump xlated id 70 opcodes
0: (bf) r6 = r1
bf 16 00 00 00 00 00 00
1: (b7) r1 = 0
b7 01 00 00 00 00 00 00
2: (7b) *(u64 *)(r10 -16) = r1
7b 1a f0 ff 00 00 00 00
3: (85) call bpf_get_current_pid_tgid#98768
85 00 00 00 d0 81 01 00
4: (77) r0 >>= 32
77 00 00 00 20 00 00 00
5: (7b) *(u64 *)(r10 -8) = r0
7b 0a f8 ff 00 00 00 00
6: (18) r7 = map[id:24]
18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00
[...]
BPF bytecode with human words (2)
Using bpftool
on a running
instance of the
program
BPF Internals (Brendan Gregg)
# bpftool prog dump xlated id 70
0: (bf) r6 = r1
1: (b7) r1 = 0
2: (7b) *(u64 *)(r10 -16) = r1
3: (85) call bpf_get_current_pid_tgid#98768
4: (77) r0 >>= 32
5: (7b) *(u64 *)(r10 -8) = r0
6: (18) r7 = map[id:24]
8: (85) call bpf_get_smp_processor_id#98544
9: (bf) r4 = r10
10: (07) r4 += -16
11: (bf) r1 = r6
12: (bf) r2 = r7
13: (bf) r3 = r0
14: (b7) r5 = 16
15: (85) call bpf_perf_event_output#-54224
16: (b7) r0 = 0
17: (95) exit
BPF bytecode, opcodes only
just the opcode 8 bits
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 9/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
BPF bytecode native machine code
→
x86 machine code
BPF
bytecode
arch/x86/net/bpf_jit_comp.c
arm machine code
sparc machine code
...
arch/arm64/net/bpf_jit_comp.c
arch/sparc/net/bpf_jit_*
BPF Internals (Brendan Gregg)
BPF bytecode x86 machine code
→
If ...
static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
int oldproglen, struct jit_context *ctx)
{
[…]
for (i = 1; i <= insn_cnt; i++, insn++) {
[…]
switch (insn->code) {
[…]
case BPF_JMP | BPF_CALL:
func = (u8 *) __bpf_call_base + imm32;
if (!imm32 || emit_call(&prog, func, image + addrs[i - 1]))
return -EINVAL;
break;
[…]
static int emit_call(u8 **pprog, void *func, void *ip)
{
return emit_patch(pprog, func, ip, 0xE8);
}
Linux arch/x86/net/bpf_jit_comp.c
BPF Internals (Brendan Gregg)
BPF bytecode x86 machine code
→
If ...
static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
int oldproglen, struct jit_context *ctx)
{
[…]
for (i = 1; i <= insn_cnt; i++, insn++) {
[…]
switch (insn->code) {
[…]
case BPF_JMP | BPF_CALL:
func = (u8 *) __bpf_call_base + imm32;
if (!imm32 || emit_call(&prog, func, image + addrs[i - 1]))
return -EINVAL;
break;
[…]
static int emit_call(u8 **pprog, void *func, void *ip)
{
return emit_patch(pprog, func, ip, 0xE8);
}
Linux arch/x86/net/bpf_jit_comp.c
E.g., call get_current_pid_tgid
0xe8 is x86 CALL
BPF Internals (Brendan Gregg)
# bpftool prog dump jited id 80 opcodes | grep -v :
55
48 89 e5
48 81 ec 10 00 00 00
53
41 55
41 56
41 57
6a 00
48 89 fb
31 ff
48 89 7d f0
e8 a0 8b 44 c2
48 c1 e8 20
48 89 45 f8
49 bd 00 b6 7b b8 3a 9d ff ff
e8 a9 8a 44 c2
48 89 e9
48 83 c1 f0
48 89 df
[...]
Now you have x86 machine code!
31 instructions
BPF Internals (Brendan Gregg)
# bpftool prog dump jited id 80 opcodes | grep -v :
55
48 89 e5
48 81 ec 10 00 00 00
53
41 55
41 56
41 57
6a 00
48 89 fb
31 ff
48 89 7d f0
e8 a0 8b 44 c2
48 c1 e8 20
48 89 45 f8
49 bd 00 b6 7b b8 3a 9d ff ff
e8 a9 8a 44 c2
48 89 e9
48 83 c1 f0
48 89 df
[...]
Now you have x86 machine code!
31 instructions
CALL get_current_pid_tgid
BPF Internals (Brendan Gregg)
# bpftool prog dump jited id 71 opcodes | grep -v :
fd 7b bf a9
fd 03 00 91
f3 53 bf a9
f5 5b bf a9
f9 6b bf a9
f9 03 00 91
1a 00 80 d2
ff 43 00 d1
13 00 00 91
00 00 80 d2
ea 01 80 92
20 6b 2a f8
ea 48 9e 92
0a 05 a2 f2
0a 00 d0 f2
40 01 3f d6
07 00 00 91
e7 fc 60 d3
ea 00 80 92
[...]
… or you have ARM machine code!
48 instructions
BPF Internals (Brendan Gregg)
# bpftool prog dump jited id 80
0xffffffffc0192b2e:
0: push %rbp
1: mov %rsp,%rbp
4: sub $0x10,%rsp
b: push %rbx
c: push %r13
e: push %r14
10: push %r15
12: pushq $0x0
14: mov %rdi,%rbx
17: xor %edi,%edi
19: mov %rdi,-0x10(%rbp)
1d: callq 0xffffffffc2448bc2
22: shr $0x20,%rax
26: mov %rax,-0x8(%rbp)
2a: movabs $0xffff9d3ab87bb600,%r13
34: callq 0xffffffffc2448ae2
39: mov %rbp,%rcx
3c: add $0xfffffffffffffff0,%rcx
[...]
x86 instruction disassembly
BPF Internals (Brendan Gregg)
# bpftool prog dump jited id 80
0xffffffffc0192b2e:
0: push %rbp
1: mov %rsp,%rbp
4: sub $0x10,%rsp
b: push %rbx
c: push %r13
e: push %r14
10: push %r15
12: pushq $0x0
14: mov %rdi,%rbx
17: xor %edi,%edi
19: mov %rdi,-0x10(%rbp)
1d: callq 0xffffffffc2448bc2
22: shr $0x20,%rax
26: mov %rax,-0x8(%rbp)
2a: movabs $0xffff9d3ab87bb600,%r13
34: callq 0xffffffffc2448ae2
39: mov %rbp,%rcx
3c: add $0xfffffffffffffff0,%rcx
[...]
x86 instruction disassembly (2)
BPF prologue
BPF program
get_current_pid_tgid
BPF Internals (Brendan Gregg)
Plus you have BPF helper code
BPF_CALL_0(bpf_get_current_pid_tgid)
{
struct task_struct *task = current;
if (unlikely(!task))
return -EINVAL;
return (u64) task->tgid << 32 | task->pid;
}
[...]
Linux kernel/bpf/helpers.c
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 10/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
# strace -fe perf_event_open,bpf,ioctl bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid); }'
[...]
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7f6e826cf000,
license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8,
18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0,
expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0,
func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0,
attach_btf_id=0, attach_prog_fd=0}, 120) = 14
perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER5, config=0, ...},
-1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 13
ioctl(13, PERF_EVENT_IOC_SET_BPF, 14) = 0
ioctl(13, PERF_EVENT_IOC_ENABLE, 0) = 0
Attaching BPF to a kprobe
BPF Internals (Brendan Gregg)
# strace -fe perf_event_open,bpf,ioctl bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid); }'
[...]
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7f6e826cf000,
license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8,
18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0,
expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0,
func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0,
attach_btf_id=0, attach_prog_fd=0}, 120) = 14
perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER5, config=0, ...},
-1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 13
ioctl(13, PERF_EVENT_IOC_SET_BPF, 14) = 0
ioctl(13, PERF_EVENT_IOC_ENABLE, 0) = 0
Attaching BPF to a kprobe (2)
BPF program
(#14)
kprobe
(#13)
ioctl()
creates the kprobe
strace is lacking some translation
bpf() perf_event_open()
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 11/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
kprobes
If ...
static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
{
struct restart_block *restart;
do {
set_current_state(TASK_INTERRUPTIBLE);
hrtimer_sleeper_start_expires(t, mode);
if (likely(t->task))
[...]
Linux kernel/time/hrtimer.c
How do we instrument this?
(it’s actually quite easy)
BPF Internals (Brendan Gregg)
(gdb) disas/r do_nanosleep
Dump of assembler code for function do_nanosleep:
0xffffffff81b7d810 <+0>: e8 1b 00 4f ff callq 0xffffffff8106d830 <__fentry__>
0xffffffff81b7d815 <+5>: 55 push %rbp
0xffffffff81b7d816 <+6>: 89 f1 mov %esi,%ecx
0xffffffff81b7d818 <+8>: 48 89 e5 mov %rsp,%rbp
0xffffffff81b7d81b <+11>: 41 55 push %r13
0xffffffff81b7d81d <+13>: 41 54 push %r12
0xffffffff81b7d81f <+15>: 53 push %rbx
[...]
Instrumenting live kernel functions
BPF Internals (Brendan Gregg)
(gdb) disas/r do_nanosleep
Dump of assembler code for function do_nanosleep:
0xffffffff81b7d810 <+0>: e8 1b 00 4f ff callq 0xffffffff8106d830 <__fentry__>
0xffffffff81b7d815 <+5>: 55 push %rbp
0xffffffff81b7d816 <+6>: 89 f1 mov %esi,%ecx
0xffffffff81b7d818 <+8>: 48 89 e5 mov %rsp,%rbp
0xffffffff81b7d81b <+11>: 41 55 push %r13
0xffffffff81b7d81d <+13>: 41 54 push %r12
0xffffffff81b7d81f <+15>: 53 push %rbx
[...]
Instrumenting live kernel functions (2)
A) Ftrace is already there. Kprobes can add a handler.
B) Or a breakpoint written (e.g., int3).
C) Or a jmp is written.
this is usually
nop’d out
May need to stop_machine() to ensure other
cores don’t execute changing instruction text
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 12/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
Perf output buffers
# strace -fe perf_event_open bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid); }'
strace: Process 3229968 attached
[pid 3229968] +++ exited with 0 +++
perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 0, -1,
PERF_FLAG_FD_CLOEXEC) = 5
perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 1, -1,
PERF_FLAG_FD_CLOEXEC) = 6
perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 2, -1,
PERF_FLAG_FD_CLOEXEC) = 7
[...]
BPF Internals (Brendan Gregg)
Perf output buffers
# strace -fe perf_event_open bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid); }'
strace: Process 3229968 attached
[pid 3229968] +++ exited with 0 +++
perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 0, -1,
PERF_FLAG_FD_CLOEXEC) = 5
perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 1, -1,
PERF_FLAG_FD_CLOEXEC) = 6
perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 2, -1,
PERF_FLAG_FD_CLOEXEC) = 7
[...]
CPU ID
output buffer 5 CPU 0
output buffer 6 CPU 1
output buffer 7 CPU 2
...
bpftrace
fd 5
fd 6
fd 7
bpftrace waits for events using epoll_wait(2)
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 13/13
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
bpftrace printf & async actions
enum class AsyncAction
{
// clang-format off
printf = 0, // printf reserves 0-9999 for printf_ids
syscall = 10000, // system reserves 10000-19999 for printf_ids
cat = 20000, // cat reserves 20000-29999 for printf_ids
exit = 30000,
print,
clear,
[...]
bpftrace src/types.h
printf_id
64-bit
arguments
bpftrace perf output message format:
E.g., printf_id 0: "PID %d sleeping...n"
High numbers are used
for other async actions:
BPF Internals (Brendan Gregg)
bpftrace printf & async actions (2)
void perf_event_printer(void *cb_cookie, void *data, int size)
{
[…]
auto printf_id = *reinterpret_cast<uint64_t *>(arg_data);
[…]
// async actions
if (printf_id == asyncactionint(AsyncAction::exit))
{
bpftrace->request_finalize();
return;
}
[…]
// printf
auto fmt = std::get<0>(bpftrace->printf_args_[printf_id]);
auto args = std::get<1>(bpftrace->printf_args_[printf_id]);
auto arg_values = bpftrace->get_arg_values(args, arg_data);
bpftrace->out_->message(MessageType::printf, format(fmt, arg_values), false);
bpftrace src/bpftrace.cpp
message() just prints it out
BPF Internals (Brendan Gregg)
# bpftrace -e 'kprobe:do_nanosleep {
printf("PID %d sleeping...n", pid);
}'
Attaching 1 probe...
PID 10287 sleeping...
PID 10297 sleeping...
PID 10287 sleeping…
PID 10297 sleeping...
PID 10287 sleeping...
PID 2218 sleeping...
PID 10297 sleeping...
[...]
Final output
BPF Internals (Brendan Gregg)
2. Static tracing and map summaries
BPF Internals (Brendan Gregg)
2. Static tracing and map summaries
bpftrace -e 'tracepoint:block:block_rq_issue {
@[comm] = count();
}'
BPF Internals (Brendan Gregg)
# bpftrace -e 'tracepoint:block:block_rq_issue {
@[comm] = count();
}'
Attaching 1 probe…
^C
@[kworker/2:2H]: 131
@[chrome]: 135
@[kworker/7:1H]: 185
@[Xorg]: 245
@[tar]: 1204
@[dmcrypt_write/2]: 1993
Example output
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 1/4
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
map @{ident}|@
[...]
Lexer & Yacc
bpftrace src/lexer.l
%token <std::string> MAP "map"
[...]
stmt : expr { $$ = new ast::ExprStatement($1, @1); }
| compound_assignment { $$ = $1; }
| jump_stmt { $$ = $1; }
| map "=" expr { $$ = new ast::AssignMapStatement($1, $3, false, @2); }
[...]
map : MAP { $$ = new ast::Map($1, @$); }
| MAP "[" vargs "]" { $$ = new ast::Map($1, $3, @$); }
bpftrace src/parser.yy
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 2/4
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
BPF maps
●
Custom data storage
●
Can be a key/value store (hash)
●
Also used for histogram
summaries (BPF code calculates
a bucket index as the key) tar 1204
chrome 135
Xorg 245
keys values
@
BPF Internals (Brendan Gregg)
BPF map operations
User-space BPF-space (kernel)
bpf(2) syscall
BPF_MAP_CREATE
BPF_MAP_LOOKUP_ELEM
BPF_MAP_UPDATE_ELEM
BPF_MAP_DELETE_ELEM
BPF_MAP_GET_NEXT_KEY
[...]
BPF helpers
bpf_map_lookup_elem()
bpf_map_update_elem()
bpf_map_delete_elem()
[...]
BPF Internals (Brendan Gregg)
Creating BPF maps
# strace -febpf bpftrace -e 'block:block_rq_issue {
@[comm] = count(); }'
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=4, max_entries=1,
map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0,
btf_value_type_id=0}, 120) = 3
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096,
map_flags=0, inner_map_fd=0, map_name="@", map_ifindex=0, btf_fd=0, btf_key_type_id=0,
btf_value_type_id=0}, 120) = -1 EINVAL (Invalid argument)
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096,
map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0,
btf_value_type_id=0}, 120) = 3
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4,
max_entries=8, map_flags=0, inner_map_fd=0, map_name="printf", map_ifindex=0, btf_fd=0,
btf_key_type_id=0, btf_value_type_id=0}, 120) = 4
[...]
BPF Internals (Brendan Gregg)
Creating BPF maps
# strace -febpf bpftrace -e 'block:block_rq_issue {
@[comm] = count(); }'
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=4, max_entries=1,
map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0,
btf_value_type_id=0}, 120) = 3
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096,
map_flags=0, inner_map_fd=0, map_name="@", map_ifindex=0, btf_fd=0, btf_key_type_id=0,
btf_value_type_id=0}, 120) = -1 EINVAL (Invalid argument)
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096,
map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0,
btf_value_type_id=0}, 120) = 3
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4,
max_entries=8, map_flags=0, inner_map_fd=0, map_name="printf", map_ifindex=0, btf_fd=0,
btf_key_type_id=0, btf_value_type_id=0}, 120) = 4
[...]
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 3/4
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
Tracepoint defined
If ...
DECLARE_EVENT_CLASS(block_rq,
TP_PROTO(struct request_queue *q, struct request *rq),
TP_ARGS(q, rq),
TP_STRUCT__entry(
__field( dev_t, dev )
__field( sector_t, sector )
__field( unsigned int, nr_sector )
__field( unsigned int, bytes )
__array( char, rwbs, RWBS_LEN )
__array( char, comm, TASK_COMM_LEN )
__dynamic_array( char, cmd, 1 )
),
TP_fast_assign(
__entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0;
__entry->sector = blk_rq_trace_sector(rq);
[...]
DEFINE_EVENT(block_rq, block_rq_issue,
TP_PROTO(struct request_queue *q, struct request *rq),
TP_ARGS(q, rq)
);
Linux include/trace/events/block.h
BPF Internals (Brendan Gregg)
Tracepoints in code
If ...
void blk_mq_start_request(struct request *rq)
{
struct request_queue *q = rq->q;
trace_block_rq_issue(q, rq);
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
rq->io_start_time_ns = ktime_get_ns();
rq->stats_sectors = blk_rq_sectors(rq);
[...]
Linux block/block-mq.c
This is a (best effort) stable interface
Use tracepoints instead of kprobes when possible!
BPF Internals (Brendan Gregg)
(gdb) disas/r blk_mq_start_request
Dump of assembler code for function blk_mq_start_request:
0xffffffff815118e0 <+0>: e8 4b bf b5 ff callq 0xffffffff8106d830 <__fentry__>
0xffffffff815118e5 <+5>: 55 push %rbp
0xffffffff815118e6 <+6>: 48 89 e5 mov %rsp,%rbp
0xffffffff815118e9 <+9>: 41 55 push %r13
0xffffffff815118eb <+11>: 41 54 push %r12
0xffffffff815118ed <+13>: 49 89 fc mov %rdi,%r12
0xffffffff815118f0 <+16>: 53 push %rbx
0xffffffff815118f1 <+17>: 4c 8b 2f mov (%rdi),%r13
0xffffffff815118f4 <+20>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0xffffffff815118f9 <+25>: 49 8b 45 60 mov 0x60(%r13),%rax
[...]
Instrumenting tracepoints (2)
How do we include the tracepoint without adding overhead?
(this is actually quite easy)
BPF Internals (Brendan Gregg)
(gdb) disas/r blk_mq_start_request
Dump of assembler code for function blk_mq_start_request:
0xffffffff815118e0 <+0>: e8 4b bf b5 ff callq 0xffffffff8106d830 <__fentry__>
0xffffffff815118e5 <+5>: 55 push %rbp
0xffffffff815118e6 <+6>: 48 89 e5 mov %rsp,%rbp
0xffffffff815118e9 <+9>: 41 55 push %r13
0xffffffff815118eb <+11>: 41 54 push %r12
0xffffffff815118ed <+13>: 49 89 fc mov %rdi,%r12
0xffffffff815118f0 <+16>: 53 push %rbx
0xffffffff815118f1 <+17>: 4c 8b 2f mov (%rdi),%r13
0xffffffff815118f4 <+20>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
0xffffffff815118f9 <+25>: 49 8b 45 60 mov 0x60(%r13),%rax
[...]
Instrumenting tracepoints (2)
This 5-byte nop is a placeholder. Does nothing quickly.
When the tracepoint is enabled, the nop becomes a jmp to the tracepoint
trampoline.
How do we include the tracepoint without adding overhead?
BPF Internals (Brendan Gregg)
bpftrace mid-level internals 4/4
BPF Internals (Brendan Gregg)
BPF Internals (Brendan Gregg)
User-space map iteration
If ...
int BPFtrace::print_map(IMap &map, uint32_t top, uint32_t div)
[…]
while (bpf_get_next_key(map.mapfd_, old_key.data(), key.data()) == 0)
{
int value_size = map.type_.GetSize();
value_size *= nvalues;
auto value = std::vector<uint8_t>(value_size);
int err = bpf_lookup_elem(map.mapfd_, key.data(), value.data());
if (err == -1)
{
// key was removed by the eBPF program during bpf_get_next_key() and bpf_lookup_elem(),
// let's skip this key
continue;
}
else if (err)
{
LOG(ERROR) << "failed to look up elem: " << err;
return -1;
}
values_by_key.push_back({key, value});
old_key = key;
}
bpftrace src/bpftrace.cpp
libbcc/
libbpf
bpf(2)
syscall
kernel
BPF Internals (Brendan Gregg)
Reading entire BPF maps
# strace -febpf bpftrace -e 'block:block_rq_issue {
@[comm] = count(); }'
[...]
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e422133e0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = -
1 ENOENT (No such file or directory)
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0
bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0
bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0
[...]
This is an infrequent activity (this program only does this once)
BPF Internals (Brendan Gregg)
# bpftrace -e 'tracepoint:block:block_rq_issue {
@[comm] = count();
}'
Attaching 1 probe…
^C
@[kworker/2:2H]: 131
@[chrome]: 135
@[kworker/7:1H]: 185
@[Xorg]: 245
@[tar]: 1204
@[dmcrypt_write/2]: 1993
Final output
BPF Internals (Brendan Gregg)
Stack walking
BTF
CO-RE
Tests
raw_tracepoints & fentry
Networking & XDP
Security & Cgroups
Discussion of other internals
BPF Internals (Brendan Gregg)
BPF tracing/observability high-level recap
From: BPF Performance Tools, Figure 2-1
BPF Internals (Brendan Gregg)
BPF mid-level internals recap
From: BPF Performance Tools, Figure 2-3
BPF Internals (Brendan Gregg)
PSA
CONFIG_DEBUG_INFO_BTF=y
E.g., Ubuntu 20.10, Fedora 30, and RHEL 8.2 have it
BPF Internals (Brendan Gregg)
References
This is also where I recommend you go to learn more:
●
https://events.static.linuxfound.org/sites/events/files/slides/
bpf_collabsummit_2015feb20.pdf
●
Linux include/uapi/linux/bpf_common.h
●
Linux include/uapi/linux/bpf.h
●
Linux include/uapi/linux/filter.h
●
https://docs.cilium.io/en/v1.9/bpf/#bpf-guide
●
BPF Performance Tools, Addison-Wesley 2020
●
https://ebpf.io/what-is-ebpf
●
http://www.brendangregg.com/ebpf.html
●
https://github.com/iovisor/bcc
●
https://github.com/iovisor/bpftrace
BPF Internals (Brendan Gregg)
Thanks
BPF: Alexei Starovoitov (Facebook), Daniel Borkmann (Isovalent), David S. Miller (Red
Hat), Jakub Kicinski (Facebook), Yonghong Song (Facebook), Martin KaFai Lau
(Facebook), John Fastabend (Isovalent), Quentin Monnet (Isovalent), Jesper Dangaard
Brouer (Red Hat), Andrey Ignatov (Facebook), and Stanislav Fomichev (Google), Linus
Torvalds, and many more in the BPF community
LLVM BPF: Alexei Starovoitov, Chandler Carruth (Google), Yonghong Song, and more
bpftrace: Alastair Robertson (Yellowbrick Data), Dan Xu (Facebook), Bas Smit, Mary
Marchini (Netflix), Masanori Misono, Jiri Olsa, Viktor Malík, Dale Hamel, Willian Gaspar,
Augusto Mecking Caringi, and many more in the bpftrace community
USENIX
Jun, 2021
https://ebpf.io

Weitere ähnliche Inhalte

Was ist angesagt?

Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabTaeung Song
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking ExplainedThomas Graf
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Valeriy Kravchuk
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCKernel TLV
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDPlcplcp1
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network InterfacesKernel TLV
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPThomas Graf
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)shimosawa
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtAnne Nicolas
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 

Was ist angesagt? (20)

Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
BPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLabBPF / XDP 8월 세미나 KossLab
BPF / XDP 8월 세미나 KossLab
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
eBPF/XDP
eBPF/XDP eBPF/XDP
eBPF/XDP
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
 
Building Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCCBuilding Network Functions with eBPF & BCC
Building Network Functions with eBPF & BCC
 
Introduction to eBPF and XDP
Introduction to eBPF and XDPIntroduction to eBPF and XDP
Introduction to eBPF and XDP
 
Fun with Network Interfaces
Fun with Network InterfacesFun with Network Interfaces
Fun with Network Interfaces
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
eBPF maps 101
eBPF maps 101eBPF maps 101
eBPF maps 101
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven RostedtKernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 

Ähnlich wie BPF Internals (eBPF)

eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureNetronome
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prodYunong Xiao
 
Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...
Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...
Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...NETWAYS
 
Efficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsEfficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsGergely Szabó
 
Thread介紹
Thread介紹Thread介紹
Thread介紹Jack Chen
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudAndrea Righi
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBbmbouter
 
Cassandra and materialized views
Cassandra and materialized viewsCassandra and materialized views
Cassandra and materialized viewsGrzegorz Duda
 
2012 coscup - Build your PHP application on Heroku
2012 coscup - Build your PHP application on Heroku2012 coscup - Build your PHP application on Heroku
2012 coscup - Build your PHP application on Herokuronnywang_tw
 
Perl on-embedded-devices
Perl on-embedded-devicesPerl on-embedded-devices
Perl on-embedded-devicesJens Rehsack
 
Quick tour of PHP from inside
Quick tour of PHP from insideQuick tour of PHP from inside
Quick tour of PHP from insidejulien pauli
 
Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008phpbarcelona
 
Getting Started with PL/Proxy
Getting Started with PL/ProxyGetting Started with PL/Proxy
Getting Started with PL/ProxyPeter Eisentraut
 
Getting started with TDD - Confoo 2014
Getting started with TDD - Confoo 2014Getting started with TDD - Confoo 2014
Getting started with TDD - Confoo 2014Eric Hogue
 
Flutter vs Java Graphical User Interface Frameworks.pptx
Flutter vs Java Graphical User Interface Frameworks.pptx Flutter vs Java Graphical User Interface Frameworks.pptx
Flutter vs Java Graphical User Interface Frameworks.pptx Toma Velev
 
Static Optimization of PHP bytecode (PHPSC 2017)
Static Optimization of PHP bytecode (PHPSC 2017)Static Optimization of PHP bytecode (PHPSC 2017)
Static Optimization of PHP bytecode (PHPSC 2017)Nikita Popov
 
KubeCon EU 2016: Kubernetes and the Potential for Higher Level Interfaces
KubeCon EU 2016: Kubernetes and the Potential for Higher Level InterfacesKubeCon EU 2016: Kubernetes and the Potential for Higher Level Interfaces
KubeCon EU 2016: Kubernetes and the Potential for Higher Level InterfacesKubeAcademy
 
Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7Masahiro Nagano
 

Ähnlich wie BPF Internals (eBPF) (20)

eBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging InfrastructureeBPF Tooling and Debugging Infrastructure
eBPF Tooling and Debugging Infrastructure
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...
Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...
Open Source Backup Conference 2014: Hacking workshop, by Maik Aussendorf and ...
 
Efficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native EnvironmentsEfficient System Monitoring in Cloud Native Environments
Efficient System Monitoring in Cloud Native Environments
 
Thread介紹
Thread介紹Thread介紹
Thread介紹
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
The Rust Borrow Checker
The Rust Borrow CheckerThe Rust Borrow Checker
The Rust Borrow Checker
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
Cassandra and materialized views
Cassandra and materialized viewsCassandra and materialized views
Cassandra and materialized views
 
2012 coscup - Build your PHP application on Heroku
2012 coscup - Build your PHP application on Heroku2012 coscup - Build your PHP application on Heroku
2012 coscup - Build your PHP application on Heroku
 
PHP7 Presentation
PHP7 PresentationPHP7 Presentation
PHP7 Presentation
 
Perl on-embedded-devices
Perl on-embedded-devicesPerl on-embedded-devices
Perl on-embedded-devices
 
Quick tour of PHP from inside
Quick tour of PHP from insideQuick tour of PHP from inside
Quick tour of PHP from inside
 
Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008Xdebug - Derick Rethans - Barcelona PHP Conference 2008
Xdebug - Derick Rethans - Barcelona PHP Conference 2008
 
Getting Started with PL/Proxy
Getting Started with PL/ProxyGetting Started with PL/Proxy
Getting Started with PL/Proxy
 
Getting started with TDD - Confoo 2014
Getting started with TDD - Confoo 2014Getting started with TDD - Confoo 2014
Getting started with TDD - Confoo 2014
 
Flutter vs Java Graphical User Interface Frameworks.pptx
Flutter vs Java Graphical User Interface Frameworks.pptx Flutter vs Java Graphical User Interface Frameworks.pptx
Flutter vs Java Graphical User Interface Frameworks.pptx
 
Static Optimization of PHP bytecode (PHPSC 2017)
Static Optimization of PHP bytecode (PHPSC 2017)Static Optimization of PHP bytecode (PHPSC 2017)
Static Optimization of PHP bytecode (PHPSC 2017)
 
KubeCon EU 2016: Kubernetes and the Potential for Higher Level Interfaces
KubeCon EU 2016: Kubernetes and the Potential for Higher Level InterfacesKubeCon EU 2016: Kubernetes and the Potential for Higher Level Interfaces
KubeCon EU 2016: Kubernetes and the Potential for Higher Level Interfaces
 
Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7Operation Oriented Web Applications / Yokohama pm7
Operation Oriented Web Applications / Yokohama pm7
 

Mehr von Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesBrendan Gregg
 

Mehr von Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
EuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis MethodologiesEuroBSDcon 2017 System Performance Analysis Methodologies
EuroBSDcon 2017 System Performance Analysis Methodologies
 

Kürzlich hochgeladen

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

BPF Internals (eBPF)

  • 1. BPF Internals Brendan Gregg USENIX Jun, 2021 Tracing Examples (eBPF)
  • 2. BPF Internals (Brendan Gregg) (it’s actually quite easy)
  • 3. BPF Internals (Brendan Gregg) Intro & Tracing By Example 1. Dynamic tracing and per-event output 2. Static tracing and map summaries For BPF internals by reference, see the References slide Learning objectives ● Gain a working knowledge of some BPF internals ● Evaluate ideas for BPF suitability Agenda Slides: http://www.brendangregg.com/Slides/LISA2021_BPF_Internals.pdf Video: https://www.usenix.org/conference/lisa21/presentation/gregg-bpf Slides are online
  • 4. BPF Internals (Brendan Gregg) BPF Intro
  • 5. BPF Internals (Brendan Gregg) BPF 1992: Berkeley Packet Filter A runtime for efficient packet filters Also a narrow and arcane kernel technology that few knew existed # tcpdump -d host 127.0.0.1 and port 80 (000) ldh [12] (001) jeq #0x800 jt 2 jf 18 (002) ld [26] (003) jeq #0x7f000001 jt 6 jf 4 (004) ld [30] (005) jeq #0x7f000001 jt 6 jf 18 (006) ldb [23] (007) jeq #0x84 jt 10 jf 8 (008) jeq #0x6 jt 10 jf 9 (009) jeq #0x11 jt 10 jf 18 (010) ldh [20] (011) jset #0x1fff jt 18 jf 12 (012) ldxb 4*([14]&0xf) (013) ldh [x + 14] (014) jeq #0x50 jt 17 jf 15 (015) ldh [x + 16] (016) jeq #0x50 jt 17 jf 18 (017) ret #262144 (018) ret #0
  • 6. BPF Internals (Brendan Gregg) eBPF 2013+ Extended BPF (eBPF) modernized BPF Maintainers/creators: Alexei Starovoitov & Daniel Borkmann Old BPF is now “Classic BPF,” and eBPF is usually just “BPF” Classic BPF Extended BPF Word size 32-bit 64-bit Registers 2 10+1 Storage 16 slots 512 byte stack + infinite map storage Events packets many event sources BPF Internals (Brendan Gregg)
  • 7. BPF Internals (Brendan Gregg) BPF 2021 BPF is now a technology name (like LLVM) ● Some still call it eBPF A generic in-kernel execution environment ● User-defined programs ● Limited & secure kernel access ● A new type of software
  • 8. BPF Internals (Brendan Gregg) A New Type of Software Execution model User defined Compil- ation Security Failure mode Resource access User task yes any user based abort syscall, fault Kernel task no static none panic direct BPF event yes JIT, CO-RE verified, JIT error message restricted helpers
  • 9. BPF Internals (Brendan Gregg) BPF program state model Loaded Enabled event fires program ended Off-CPU On-CPU BPF attach Kernel helpers Spinning spin lock Sleeping preempt (restricted state) page fault
  • 10. BPF Internals (Brendan Gregg) BPF Tracing
  • 11. BPF Internals (Brendan Gregg) BPF tracing (observability) tools
  • 12. BPF Internals (Brendan Gregg) I want to run some tools ● bcc, bpftrace /usr/bin/* I want to hack up some new tools ● bpftrace bash, awk I want to spend weeks developing a BPF product ● bcc libbpf C, bcc Python (maybe), gobpf, libbbpf-rs C, C++ Unix analogies Recommended BPF tracing front-ends Requires LLVM; becoming obsolete / special-use only New, lightweight, CO-RE & BTF based
  • 13. BPF Internals (Brendan Gregg) BPF Internals (developing BPF was hard; understanding it is easy)
  • 14. BPF Internals (Brendan Gregg) BPF tracing/observability high-level From: BPF Performance Tools, Figure 2-1
  • 15. BPF Internals (Brendan Gregg) AST: Abstract Syntax Tree LLVM: A compiler IR: Intermediate Representation JIT: Just-in-time compilation kprobes: Kernel dynamic instrumentation uprobes: User-level dynamic instrumentation tracepoints: Kernel static instrumentation Terminology
  • 16. BPF Internals (Brendan Gregg) 1. Dynamic tracing and per-event output
  • 17. BPF Internals (Brendan Gregg) bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' 1. Dynamic tracing and per-event output
  • 18. BPF Internals (Brendan Gregg) # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' Attaching 1 probe... PID 10287 sleeping... PID 10297 sleeping... PID 10287 sleeping… PID 10297 sleeping... PID 10287 sleeping... PID 2218 sleeping... PID 10297 sleeping... [...] Example output
  • 19. BPF Internals (Brendan Gregg) bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' Objective We have this: We want: - BPF bytecode - Kernel events mapped to the bytecode - User space printing events This is learning internals by example, including bpftrace internals. For a complete internals see the References slide at end.
  • 20. BPF Internals (Brendan Gregg) bpftrace mid-level internals BPF Internals (Brendan Gregg)
  • 21. BPF Internals (Brendan Gregg) Program transformations AST bpftrace program LLVM IR BPF machine code
  • 22. BPF Internals (Brendan Gregg) bpftrace mid-level internals 1/13 BPF Internals (Brendan Gregg)
  • 23. BPF Internals (Brendan Gregg) kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); } bpftrace program
  • 24. BPF Internals (Brendan Gregg) bpftrace mid-level internals 2/13 BPF Internals (Brendan Gregg)
  • 25. BPF Internals (Brendan Gregg) kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); } Converting to AST call(printf) probe(kprobe:do_nanosleep) builtin(pid) string(“PID %d sleeping...n”)
  • 26. BPF Internals (Brendan Gregg) Parsing bpftrace lex yacc bpftrace program text AST src/lexer.l (regular expressions) src/parser.yy (grammer) (very easy)
  • 27. BPF Internals (Brendan Gregg) ident [_a-zA-Z][_a-zA-Z0-9]* map @{ident}|@ var ${ident} int [0-9]+|0[xX][0-9a-fA-F]+ cint :{int}: hex (x|X)[0-9a-fA-F]{1,2} [...] builtin arg[0-9]|args|cgroup|comm|cpid|cpu|ctx|curtask|elapsed|func| gid|nsecs|pid|probe|rand|retval|sarg[0-9]|tid|uid|username call avg|cat|cgroupid|clear|count|delete|exit|hist|join|kaddr|ksym| lhist|max|min|ntop|override_return|print|printf|reg|signal|stats|str| strncmp|sum|sym|system|time|uaddr|usym|zero pid Lexer bpftrace src/lexer.l Regular expressions
  • 28. BPF Internals (Brendan Gregg) builtin(pid) Yacc %token <std::string> BUILTIN "builtin" %token <std::string> CALL "call" [...] expr : int { $$ = $1; } | STRING { $$ = new ast::String($1, @$); } | BUILTIN { $$ = new ast::Builtin($1, @$); } | CALL_BUILTIN { $$ = new ast::Builtin($1, @$); } | IDENT { $$ = new ast::Identifier($1, @$); } | STACK_MODE { $$ = new ast::StackMode($1, @$); } | ternary { $$ = $1; } | param { $$ = $1; } | map_or_var { $$ = $1; } | call { $$ = $1; } [...] call : CALL "(" ")" { $$ = new ast::Call($1, @$); } | CALL "(" vargs ")" { $$ = new ast::Call($1, $3, @$); } bpftrace src/parser.yy Grammar rules
  • 29. BPF Internals (Brendan Gregg) ident [_a-zA-Z][_a-zA-Z0-9]* map @{ident}|@ var ${ident} int [0-9]+|0[xX][0-9a-fA-F]+ cint :{int}: hex (x|X)[0-9a-fA-F]{1,2} [...] builtin arg[0-9]|args|cgroup|comm|cpid|cpu|ctx|curtask|elapsed|func| gid|nsecs|pid|probe|rand|retval|sarg[0-9]|tid|uid|username call avg|cat|cgroupid|clear|count|delete|exit|hist|join|kaddr|ksym| lhist|max|min|ntop|override_return|print|printf|reg|signal|stats|str| strncmp|sum|sym|system|time|uaddr|usym|zero printf("PID %d sleeping...n", pid); Lexer (2) bpftrace src/lexer.l
  • 30. BPF Internals (Brendan Gregg) call(printf(...)); Yacc (2) %token <std::string> BUILTIN "builtin" %token <std::string> CALL "call" [...] expr : int { $$ = $1; } | STRING { $$ = new ast::String($1, @$); } | BUILTIN { $$ = new ast::Builtin($1, @$); } | CALL_BUILTIN { $$ = new ast::Builtin($1, @$); } | IDENT { $$ = new ast::Identifier($1, @$); } | STACK_MODE { $$ = new ast::StackMode($1, @$); } | ternary { $$ = $1; } | param { $$ = $1; } | map_or_var { $$ = $1; } | call { $$ = $1; } [...] call : CALL "(" ")" { $$ = new ast::Call($1, @$); } | CALL "(" vargs ")" { $$ = new ast::Call($1, $3, @$); } bpftrace src/parser.yy
  • 31. BPF Internals (Brendan Gregg) " { yy_push_state(STR, yyscanner); buffer.clear(); } <STR>{ " { yy_pop_state(yyscanner); return Parser::make_STRING(buffer, loc); } [^n"]+ buffer += yytext; n buffer += 'n'; t buffer += 't'; r buffer += 'r'; " buffer += '"'; buffer += ''; {oct} { long value = strtol(yytext+1, NULL, 8); if (value > UCHAR_MAX) driver.error(loc, std::string("octal escape sequence out of range '") + yytext + "'"); buffer += value; } {hex} buffer += strtol(yytext+2, NULL, 16); n driver.error(loc, "unterminated string"); yy_pop_state(yyscanner); loc.lines(1); loc.step(); <<EOF>> driver.error(loc, "unterminated string"); yy_pop_state(yyscanner); . { driver.error(loc, std::string("invalid escape character '") + yytext + "'"); } . driver.error(loc, "invalid character"); yy_pop_state(yyscanner); } "PID %d sleeping...n" Lexer (3) bpftrace src/lexer.l
  • 32. BPF Internals (Brendan Gregg) string("PID %d sleeping...n") Yacc (3) %token <std::string> STRING "string" [...] expr : int { $$ = $1; } | STRING { $$ = new ast::String($1, @$); } | BUILTIN { $$ = new ast::Builtin($1, @$); } | CALL_BUILTIN { $$ = new ast::Builtin($1, @$); } | IDENT { $$ = new ast::Identifier($1, @$); } | STACK_MODE { $$ = new ast::StackMode($1, @$); } | ternary { $$ = $1; } | param { $$ = $1; } | map_or_var { $$ = $1; } | call { $$ = $1; } [...] bpftrace src/parser.yy
  • 33. BPF Internals (Brendan Gregg) ident [_a-zA-Z][_a-zA-Z0-9]* map @{ident}|@ [...] kprobe:do_nanosleep Lexer & Yacc (4) bpftrace src/lexer.l attach_point : ident { $$ = new ast::AttachPoint($1, @$); } | ident ":" wildcard { $$ = new ast::AttachPoint($1, $3, @$); } | ident ":" wildcard PLUS INT { $$ = new ast::AttachPoint($1, $3, $5, @$); } | ident PATH STRING { $$ = new ast::AttachPoint($1, $2.substr(1, $2.si... [...] wildcard : wildcard ident { $$ = $1 + $2; } | wildcard MUL { $$ = $1 + "*"; } | wildcard LBRACKET { $$ = $1 + "["; } | wildcard RBRACKET { $$ = $1 + "]"; } | wildcard DOT { $$ = $1 + "."; } | { $$ = ""; } ; bpftrace src/parser.yy
  • 34. BPF Internals (Brendan Gregg) ident [_a-zA-Z][_a-zA-Z0-9]* map @{ident}|@ [...] kprobe:do_nanosleep Lexer & Yacc (4) attach_point : ident { $$ = new ast::AttachPoint($1, @$); } | ident ":" wildcard { $$ = new ast::AttachPoint($1, $3, @$); } | ident ":" wildcard PLUS INT { $$ = new ast::AttachPoint($1, $3, $5, @$); } | ident PATH STRING { $$ = new ast::AttachPoint($1, $2.substr(1, $2.si... [...] wildcard : wildcard ident { $$ = $1 + $2; } | wildcard MUL { $$ = $1 + "*"; } | wildcard LBRACKET { $$ = $1 + "["; } | wildcard RBRACKET { $$ = $1 + "]"; } | wildcard DOT { $$ = $1 + "."; } | { $$ = ""; } ; 1. 2. 3. 4. 5. 6. bpftrace src/lexer.l bpftrace src/parser.yy
  • 35. BPF Internals (Brendan Gregg) { statements; ... } Yacc (5) probe : attach_points pred block { $$ = new ast::Probe($1, $2, $3); } attach_points : attach_points "," attach_point { $$ = $1; $1->push_back($3); } | attach_point { $$ = new ast::AttachPointList; $$->push_back($1); } [...] block : "{" stmts "}" { $$ = $2; } ; semicolon_ended_stmt: stmt ";" { $$ = $1; } ; stmts : semicolon_ended_stmt stmts { $$ = $2; $2->insert($2->begin(), $1); } | block_stmt stmts { $$ = $2; $2->insert($2->begin(), $1); } | stmt { $$ = new ast::StatementList; $$->push_back($1); } | { $$ = new ast::StatementList; } ; bpftrace src/parser.yy Plus more grammar for program structure...
  • 36. BPF Internals (Brendan Gregg) # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' AST ------------------- Program kprobe:do_nanosleep call: printf string: PID %d sleeping...n builtin: pid [...] Now you have AST nodes!
  • 37. BPF Internals (Brendan Gregg) bpftrace mid-level internals 3/13 BPF Internals (Brendan Gregg)
  • 38. BPF Internals (Brendan Gregg) Tracepoint & Clang struct parsers kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); } Not needed for this example (no struct member dereferencing)
  • 39. BPF Internals (Brendan Gregg) bpftrace mid-level internals 4/13 BPF Internals (Brendan Gregg)
  • 40. BPF Internals (Brendan Gregg) # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pidd); }' stdin:2:36-38: ERROR: Unknown identifier: 'pidd' printf("PID %d sleeping...n", pidd); Semantic analyzer void SemanticAnalyser::visit(Identifier &identifier) { if (bpftrace_.enums_.count(identifier.ident) != 0) { identifier.type = SizedType(Type::integer, 8); } else { identifier.type = SizedType(Type::none, 0); error("Unknown identifier: '" + identifier.ident + "'", identifier.loc); } } bpftrace src/ast/semantic_analyser.cpp Catches many program errors; E.g.: >2000 lines of code
  • 41. BPF Internals (Brendan Gregg) bpftrace mid-level internals 5/13 BPF Internals (Brendan Gregg)
  • 42. BPF Internals (Brendan Gregg) AST LLVM IR → void CodegenLLVM::visit(Builtin &builtin) { […] else if (builtin.ident == "pid" || builtin.ident == "tid") { Value *pidtgid = b_.CreateGetPidTgid(); if (builtin.ident == "pid") { expr_ = b_.CreateLShr(pidtgid, 32); […] bpftrace src/ast/codegen_llvm.cpp BPF logical shift right instruction
  • 43. BPF Internals (Brendan Gregg) bpftrace src/ast/irbuilderbpf.cpp CallInst *IRBuilderBPF::CreateGetPidTgid() { // u64 bpf_get_current_pid_tgid(void) // Return: current->tgid << 32 | current->pid FunctionType *getpidtgid_func_type = FunctionType::get(getInt64Ty(), false); PointerType *getpidtgid_func_ptr_type = PointerType::get(getpidtgid_func_type, 0); Constant *getpidtgid_func = ConstantExpr::getCast( Instruction::IntToPtr, getInt64(libbpf::BPF_FUNC_get_current_pid_tgid), getpidtgid_func_ptr_type); return CreateCall(getpidtgid_func, {}, "get_pid_tgid"); } AST LLVM IR (2) → BPF helper call number
  • 44. BPF Internals (Brendan Gregg) #define __BPF_FUNC_MAPPER(FN) FN(unspec), FN(map_lookup_elem), FN(map_update_elem), FN(map_delete_elem), FN(probe_read), FN(ktime_get_ns), FN(trace_printk), FN(get_prandom_u32), FN(get_smp_processor_id), FN(skb_store_bytes), FN(l3_csum_replace), FN(l4_csum_replace), FN(tail_call), FN(clone_redirect), FN(get_current_pid_tgid), FN(get_current_uid_gid), [...] Linux include/uapi/linux/bpf.h BPF helper calls #14
  • 45. BPF Internals (Brendan Gregg) # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' […] define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section "s_kprobe:do_nanosleep_1" { entry: %printf_args = alloca %printf_t, align 8 %1 = bitcast %printf_t* %printf_args to i8* call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1) %2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0 store i64 0, i64* %2, align 8 %get_pid_tgid = tail call i64 inttoptr (i64 14 to i64 ()*)() %3 = lshr i64 %get_pid_tgid, 32 %4 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1 store i64 %3, i64* %4, align 8 %pseudo = tail call i64 @llvm.bpf.pseudo(i64 1, i64 1) %get_cpu_id = tail call i64 inttoptr (i64 8 to i64 ()*)() %perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*, i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 16) call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1) ret i64 0 } Now you have LLVM IR!
  • 46. BPF Internals (Brendan Gregg) # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' […] define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section "s_kprobe:do_nanosleep_1" { entry: %printf_args = alloca %printf_t, align 8 %1 = bitcast %printf_t* %printf_args to i8* call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1) %2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0 store i64 0, i64* %2, align 8 %get_pid_tgid = tail call i64 inttoptr (i64 14 to i64 ()*)() %3 = lshr i64 %get_pid_tgid, 32 %4 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1 store i64 %3, i64* %4, align 8 %pseudo = tail call i64 @llvm.bpf.pseudo(i64 1, i64 1) %get_cpu_id = tail call i64 inttoptr (i64 8 to i64 ()*)() %perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*, i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 16) call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1) ret i64 0 } Now you have LLVM IR! (2)
  • 47. BPF Internals (Brendan Gregg) # bpftrace -d -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' […] define i64 @"kprobe:do_nanosleep"(i8*) local_unnamed_addr section "s_kprobe:do_nanosleep_1" { entry: %printf_args = alloca %printf_t, align 8 %1 = bitcast %printf_t* %printf_args to i8* call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1) %2 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 0 store i64 0, i64* %2, align 8 [...] Now you have LLVM IR! (3) This is all generated from LLVM IR calls in bpftrace src/ast/* Lots of CreateAllocaBPF(), CreateGEP(), etc. (this gets verbose)
  • 48. BPF Internals (Brendan Gregg) bpftrace mid-level internals 6/13 BPF Internals (Brendan Gregg)
  • 49. BPF Internals (Brendan Gregg) Extended BPF instruction (bytecode) format opcode dest reg src reg signed offset signed immediate constant 8-bit 4-bit 4-bit 16-bit 32-bit ← 64-bit → opcode src inst. class 4-bit 1-bit 3-bit E.g., for ALU & JMP classes:
  • 50. BPF Internals (Brendan Gregg) Extended BPF instruction (bytecode) format (2) opcode dest reg src reg signed offset signed immediate constant 8-bit 4-bit 4-bit 16-bit 32-bit opcode src inst. class 4-bit 1-bit 3-bit E.g., for ALU & JMP classes: E.g., call get_current_pid_tgid 14 BPF_CALL BPF_JMP
  • 51. BPF Internals (Brendan Gregg) Extended BPF instruction (bytecode) format (3) opcode dest reg src reg signed offset signed immediate constant 8-bit 4-bit 4-bit 16-bit 32-bit opcode src inst. class 4-bit 1-bit 3-bit E.g., for ALU & JMP classes: E.g., call get_current_pid_tgid 14 BPF_CALL BPF_JMP #define BPF_JMP 0x05 #define BPF_CALL 0x80 Linux include/uapi/linux/bpf.h Linux include/uapi/linux/bpf_common.h
  • 52. BPF Internals (Brendan Gregg) Extended BPF instruction (bytecode) format (4) opcode dest reg src reg signed offset signed immediate constant 8-bit 4-bit 4-bit 16-bit 32-bit opcode src inst. class 4-bit 1-bit 3-bit E.g., for ALU & JMP classes: E.g., call get_current_pid_tgid 0xe0 0x00 0x00 0x00 BPF_CALL BPF_JMP #define BPF_JMP 0x05 #define BPF_CALL 0x80 Linux include/uapi/linux/bpf.h Linux include/uapi/linux/bpf_common.h 0x85 0x0 0x0 0x00 0x00
  • 53. BPF Internals (Brendan Gregg) Extended BPF instruction (bytecode) format (5) E.g., call get_current_pid_tgid (hex) 85 00 00 e0 00 00 00 As per the BPF specification (currently Linux headers)
  • 54. BPF Internals (Brendan Gregg) LLVM/Clang has a BPF target LLVM IR --target bpf BPF bytecode Future: bpftrace may include its own lightweight bpftrace compiler (BC) as an option (pros: no dependencies; cons: less optimal code)
  • 55. BPF Internals (Brendan Gregg) LLVM/Clang has a BPF target (2) LLVM LLVM IR BPF bytecode LLVM BPF target Linux include/uapi/linux/bpf_common.h Linux include/uapi/linux/bpf.h Linux include/uapi/linux/filter.h BPF specification (#defines)
  • 56. BPF Internals (Brendan Gregg) LLVM llvm/lib/Target/BPF/BPFInstrInfo.td class CALL<string OpcodeStr> : TYPE_ALU_JMP<BPF_CALL.Value, BPF_K.Value, (outs), (ins calltarget:$BrDst), !strconcat(OpcodeStr, " $BrDst"), []> { bits<32> BrDst; let Inst{31-0} = BrDst; let BPFClass = BPF_JMP; } Plus more llvm boilerplate & BPF headers shown earlier E.g., tail call i64 inttoptr (i64 14 to i64 ()*)() 14 85 00 00 e0 00 00 00 LLVM IR BPF →
  • 57. BPF Internals (Brendan Gregg) bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 0e 00 00 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 08 00 00 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 19 00 00 00 b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 Now you have BPF bytecode!
  • 58. BPF Internals (Brendan Gregg) bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 0e 00 00 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 08 00 00 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 19 00 00 00 b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 Now you have BPF bytecode! (2) 14 (BPF_FUNC_get_current_pid_tgid) 0x05 (BPF_JMP) | 0x80 (BPF_CALL)
  • 59. BPF Internals (Brendan Gregg) bpftrace mid-level internals 7/13 BPF Internals (Brendan Gregg)
  • 60. BPF Internals (Brendan Gregg) # strace -fe bpf bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' [...] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7fdde5305000, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8, 18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 14 Sending BPF bytecode to the kernel Success! Passed the verifier...
  • 61. BPF Internals (Brendan Gregg) bpftrace mid-level internals 8/13 BPF Internals (Brendan Gregg)
  • 62. BPF Internals (Brendan Gregg) BPF mid-level internals From: BPF Performance Tools, Figure 2-3
  • 63. BPF Internals (Brendan Gregg) 85 00 00 00 12 34 56 78 Verifying BPF instructions If ... static int do_check(struct bpf_verifier_env *env) [...] } else if (class == BPF_JMP || class == BPF_JMP32) { u8 opcode = BPF_OP(insn->code); env->jmps_processed++; if (opcode == BPF_CALL) { [...] err = check_helper_call(env, insn->imm, env->insn_idx); [...] static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn_idx) { const struct bpf_func_proto *fn = NULL; struct bpf_reg_state *regs; struct bpf_call_arg_meta meta; bool changes_data; int i, err; /* find function prototype */ if (func_id < 0 || func_id >= __BPF_FUNC_MAX_ID) { verbose(env, "invalid func %s#%dn", func_id_name(func_id), func_id); return -EINVAL; Linux kernel/bpf/verifier.c >9000 lines of code Imagine we call a bogus function...
  • 64. BPF Internals (Brendan Gregg) >9000 lines of code >260 error returns Checks every instruction Checks every code path Rewrites some bytecode BPF verifier check_helper_mem_access check_func_arg check_map_func_compatibility check_func_proto check_func_call check_reference_leak check_helper_call check_alu_op check_cond_jmp_op check_ld_imm check_ld_abs check_return_code check_cfg check_btf_func check_btf_line check_btf_info check_map_prealloc check_map_prog_compatibility check_struct_ops_btf_id check_attach_modify_return check_attach_btf_id Verifier functions: check_subprogs check_reg_arg check_stack_write check_stack_read check_stack_access check_map_access_type check_mem_region_access check_map_access check_packet_access check_ctx_access check_flow_keys_access check_sock_access check_pkt_ptr_alignment check_generic_ptr_alignment check_ptr_alignment check_max_stack_depth check_tp_buffer_access check_ptr_to_btf_access check_mem_access check_xadd check_stack_boundary
  • 65. BPF Internals (Brendan Gregg) ● Memory access – Direct access extremely restricted – Can only read initialized memory – Other kernel memory must pass through the bpf_probe_read() helper and its checks ● Arguments are the correct type ● Register usage allowed – E.g., no frame pointer writes ● No write overflows ● No addr leaks ● Etc. Verifying Instructions STACK CTX SOCKET MAP_VALUE LOAD STORE Memory bpf_probe_read random addr
  • 66. BPF Internals (Brendan Gregg) Verifying Code Paths ● All instruction must lead to exit ● No unreachable instructions ● No backwards branches (loops) except BPF bounded loops biolatency as GraphViz dot:
  • 67. BPF Internals (Brendan Gregg) Verifying Code Paths ● All instruction must lead to exit ● No unreachable instructions ● No backwards branches (loops) except BPF bounded loops biolatency as GraphViz dot:
  • 68. BPF Internals (Brendan Gregg) bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 0e 00 00 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 30 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 08 00 00 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 19 00 00 00 b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 Pre-verifier BPF bytecode
  • 69. BPF Internals (Brendan Gregg) bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 d0 81 01 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 f0 80 01 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 30 2c ff ff b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 Post-verifier BPF bytecode
  • 70. BPF Internals (Brendan Gregg) bf 16 00 00 00 00 00 00 b7 01 00 00 00 00 00 00 7b 1a f0 ff 00 00 00 00 85 00 00 00 d0 81 01 00 77 00 00 00 20 00 00 00 7b 0a f8 ff 00 00 00 00 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 85 00 00 00 f0 80 01 00 bf a4 00 00 00 00 00 00 07 04 00 00 f0 ff ff ff bf 61 00 00 00 00 00 00 bf 72 00 00 00 00 00 00 bf 03 00 00 00 00 00 00 b7 05 00 00 10 00 00 00 85 00 00 00 30 2c ff ff b7 00 00 00 00 00 00 00 95 00 00 00 00 00 00 00 Post-verifier BPF bytecode (2) E.g., call get_current_pid_tgid helper index value has become an instruction offset addresses from __bpf_call_base
  • 71. BPF Internals (Brendan Gregg) # bpftool prog show […] 70: kprobe name do_nanosleep tag 8dc93a3b6a21ef3b gpl loaded_at 2021-05-02T00:44:26+0000 uid 0 xlated 144B jited 96B memlock 4096B map_ids 24 # bpftool prog dump xlated id 70 opcodes 0: (bf) r6 = r1 bf 16 00 00 00 00 00 00 1: (b7) r1 = 0 b7 01 00 00 00 00 00 00 2: (7b) *(u64 *)(r10 -16) = r1 7b 1a f0 ff 00 00 00 00 3: (85) call bpf_get_current_pid_tgid#98768 85 00 00 00 d0 81 01 00 4: (77) r0 >>= 32 77 00 00 00 20 00 00 00 5: (7b) *(u64 *)(r10 -8) = r0 7b 0a f8 ff 00 00 00 00 6: (18) r7 = map[id:24] 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 [...] BPF bytecode with human words Using bpftool on a running instance of the program
  • 72. BPF Internals (Brendan Gregg) # bpftool prog show […] 70: kprobe name do_nanosleep tag 8dc93a3b6a21ef3b gpl loaded_at 2021-05-02T00:44:26+0000 uid 0 xlated 144B jited 96B memlock 4096B map_ids 24 # bpftool prog dump xlated id 70 opcodes 0: (bf) r6 = r1 bf 16 00 00 00 00 00 00 1: (b7) r1 = 0 b7 01 00 00 00 00 00 00 2: (7b) *(u64 *)(r10 -16) = r1 7b 1a f0 ff 00 00 00 00 3: (85) call bpf_get_current_pid_tgid#98768 85 00 00 00 d0 81 01 00 4: (77) r0 >>= 32 77 00 00 00 20 00 00 00 5: (7b) *(u64 *)(r10 -8) = r0 7b 0a f8 ff 00 00 00 00 6: (18) r7 = map[id:24] 18 17 00 00 18 00 00 00 00 00 00 00 00 00 00 00 [...] BPF bytecode with human words (2) Using bpftool on a running instance of the program
  • 73. BPF Internals (Brendan Gregg) # bpftool prog dump xlated id 70 0: (bf) r6 = r1 1: (b7) r1 = 0 2: (7b) *(u64 *)(r10 -16) = r1 3: (85) call bpf_get_current_pid_tgid#98768 4: (77) r0 >>= 32 5: (7b) *(u64 *)(r10 -8) = r0 6: (18) r7 = map[id:24] 8: (85) call bpf_get_smp_processor_id#98544 9: (bf) r4 = r10 10: (07) r4 += -16 11: (bf) r1 = r6 12: (bf) r2 = r7 13: (bf) r3 = r0 14: (b7) r5 = 16 15: (85) call bpf_perf_event_output#-54224 16: (b7) r0 = 0 17: (95) exit BPF bytecode, opcodes only just the opcode 8 bits
  • 74. BPF Internals (Brendan Gregg) bpftrace mid-level internals 9/13 BPF Internals (Brendan Gregg)
  • 75. BPF Internals (Brendan Gregg) BPF bytecode native machine code → x86 machine code BPF bytecode arch/x86/net/bpf_jit_comp.c arm machine code sparc machine code ... arch/arm64/net/bpf_jit_comp.c arch/sparc/net/bpf_jit_*
  • 76. BPF Internals (Brendan Gregg) BPF bytecode x86 machine code → If ... static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int oldproglen, struct jit_context *ctx) { […] for (i = 1; i <= insn_cnt; i++, insn++) { […] switch (insn->code) { […] case BPF_JMP | BPF_CALL: func = (u8 *) __bpf_call_base + imm32; if (!imm32 || emit_call(&prog, func, image + addrs[i - 1])) return -EINVAL; break; […] static int emit_call(u8 **pprog, void *func, void *ip) { return emit_patch(pprog, func, ip, 0xE8); } Linux arch/x86/net/bpf_jit_comp.c
  • 77. BPF Internals (Brendan Gregg) BPF bytecode x86 machine code → If ... static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, int oldproglen, struct jit_context *ctx) { […] for (i = 1; i <= insn_cnt; i++, insn++) { […] switch (insn->code) { […] case BPF_JMP | BPF_CALL: func = (u8 *) __bpf_call_base + imm32; if (!imm32 || emit_call(&prog, func, image + addrs[i - 1])) return -EINVAL; break; […] static int emit_call(u8 **pprog, void *func, void *ip) { return emit_patch(pprog, func, ip, 0xE8); } Linux arch/x86/net/bpf_jit_comp.c E.g., call get_current_pid_tgid 0xe8 is x86 CALL
  • 78. BPF Internals (Brendan Gregg) # bpftool prog dump jited id 80 opcodes | grep -v : 55 48 89 e5 48 81 ec 10 00 00 00 53 41 55 41 56 41 57 6a 00 48 89 fb 31 ff 48 89 7d f0 e8 a0 8b 44 c2 48 c1 e8 20 48 89 45 f8 49 bd 00 b6 7b b8 3a 9d ff ff e8 a9 8a 44 c2 48 89 e9 48 83 c1 f0 48 89 df [...] Now you have x86 machine code! 31 instructions
  • 79. BPF Internals (Brendan Gregg) # bpftool prog dump jited id 80 opcodes | grep -v : 55 48 89 e5 48 81 ec 10 00 00 00 53 41 55 41 56 41 57 6a 00 48 89 fb 31 ff 48 89 7d f0 e8 a0 8b 44 c2 48 c1 e8 20 48 89 45 f8 49 bd 00 b6 7b b8 3a 9d ff ff e8 a9 8a 44 c2 48 89 e9 48 83 c1 f0 48 89 df [...] Now you have x86 machine code! 31 instructions CALL get_current_pid_tgid
  • 80. BPF Internals (Brendan Gregg) # bpftool prog dump jited id 71 opcodes | grep -v : fd 7b bf a9 fd 03 00 91 f3 53 bf a9 f5 5b bf a9 f9 6b bf a9 f9 03 00 91 1a 00 80 d2 ff 43 00 d1 13 00 00 91 00 00 80 d2 ea 01 80 92 20 6b 2a f8 ea 48 9e 92 0a 05 a2 f2 0a 00 d0 f2 40 01 3f d6 07 00 00 91 e7 fc 60 d3 ea 00 80 92 [...] … or you have ARM machine code! 48 instructions
  • 81. BPF Internals (Brendan Gregg) # bpftool prog dump jited id 80 0xffffffffc0192b2e: 0: push %rbp 1: mov %rsp,%rbp 4: sub $0x10,%rsp b: push %rbx c: push %r13 e: push %r14 10: push %r15 12: pushq $0x0 14: mov %rdi,%rbx 17: xor %edi,%edi 19: mov %rdi,-0x10(%rbp) 1d: callq 0xffffffffc2448bc2 22: shr $0x20,%rax 26: mov %rax,-0x8(%rbp) 2a: movabs $0xffff9d3ab87bb600,%r13 34: callq 0xffffffffc2448ae2 39: mov %rbp,%rcx 3c: add $0xfffffffffffffff0,%rcx [...] x86 instruction disassembly
  • 82. BPF Internals (Brendan Gregg) # bpftool prog dump jited id 80 0xffffffffc0192b2e: 0: push %rbp 1: mov %rsp,%rbp 4: sub $0x10,%rsp b: push %rbx c: push %r13 e: push %r14 10: push %r15 12: pushq $0x0 14: mov %rdi,%rbx 17: xor %edi,%edi 19: mov %rdi,-0x10(%rbp) 1d: callq 0xffffffffc2448bc2 22: shr $0x20,%rax 26: mov %rax,-0x8(%rbp) 2a: movabs $0xffff9d3ab87bb600,%r13 34: callq 0xffffffffc2448ae2 39: mov %rbp,%rcx 3c: add $0xfffffffffffffff0,%rcx [...] x86 instruction disassembly (2) BPF prologue BPF program get_current_pid_tgid
  • 83. BPF Internals (Brendan Gregg) Plus you have BPF helper code BPF_CALL_0(bpf_get_current_pid_tgid) { struct task_struct *task = current; if (unlikely(!task)) return -EINVAL; return (u64) task->tgid << 32 | task->pid; } [...] Linux kernel/bpf/helpers.c
  • 84. BPF Internals (Brendan Gregg) bpftrace mid-level internals 10/13 BPF Internals (Brendan Gregg)
  • 85. BPF Internals (Brendan Gregg) # strace -fe perf_event_open,bpf,ioctl bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' [...] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7f6e826cf000, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8, 18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 14 perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER5, config=0, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 13 ioctl(13, PERF_EVENT_IOC_SET_BPF, 14) = 0 ioctl(13, PERF_EVENT_IOC_ENABLE, 0) = 0 Attaching BPF to a kprobe
  • 86. BPF Internals (Brendan Gregg) # strace -fe perf_event_open,bpf,ioctl bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' [...] bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_KPROBE, insn_cnt=18, insns=0x7f6e826cf000, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(5, 8, 18), prog_flags=0, prog_name="do_nanosleep", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 14 perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER5, config=0, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 13 ioctl(13, PERF_EVENT_IOC_SET_BPF, 14) = 0 ioctl(13, PERF_EVENT_IOC_ENABLE, 0) = 0 Attaching BPF to a kprobe (2) BPF program (#14) kprobe (#13) ioctl() creates the kprobe strace is lacking some translation bpf() perf_event_open()
  • 87. BPF Internals (Brendan Gregg) bpftrace mid-level internals 11/13 BPF Internals (Brendan Gregg)
  • 88. BPF Internals (Brendan Gregg) kprobes If ... static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode) { struct restart_block *restart; do { set_current_state(TASK_INTERRUPTIBLE); hrtimer_sleeper_start_expires(t, mode); if (likely(t->task)) [...] Linux kernel/time/hrtimer.c How do we instrument this? (it’s actually quite easy)
  • 89. BPF Internals (Brendan Gregg) (gdb) disas/r do_nanosleep Dump of assembler code for function do_nanosleep: 0xffffffff81b7d810 <+0>: e8 1b 00 4f ff callq 0xffffffff8106d830 <__fentry__> 0xffffffff81b7d815 <+5>: 55 push %rbp 0xffffffff81b7d816 <+6>: 89 f1 mov %esi,%ecx 0xffffffff81b7d818 <+8>: 48 89 e5 mov %rsp,%rbp 0xffffffff81b7d81b <+11>: 41 55 push %r13 0xffffffff81b7d81d <+13>: 41 54 push %r12 0xffffffff81b7d81f <+15>: 53 push %rbx [...] Instrumenting live kernel functions
  • 90. BPF Internals (Brendan Gregg) (gdb) disas/r do_nanosleep Dump of assembler code for function do_nanosleep: 0xffffffff81b7d810 <+0>: e8 1b 00 4f ff callq 0xffffffff8106d830 <__fentry__> 0xffffffff81b7d815 <+5>: 55 push %rbp 0xffffffff81b7d816 <+6>: 89 f1 mov %esi,%ecx 0xffffffff81b7d818 <+8>: 48 89 e5 mov %rsp,%rbp 0xffffffff81b7d81b <+11>: 41 55 push %r13 0xffffffff81b7d81d <+13>: 41 54 push %r12 0xffffffff81b7d81f <+15>: 53 push %rbx [...] Instrumenting live kernel functions (2) A) Ftrace is already there. Kprobes can add a handler. B) Or a breakpoint written (e.g., int3). C) Or a jmp is written. this is usually nop’d out May need to stop_machine() to ensure other cores don’t execute changing instruction text
  • 91. BPF Internals (Brendan Gregg) bpftrace mid-level internals 12/13 BPF Internals (Brendan Gregg)
  • 92. BPF Internals (Brendan Gregg) Perf output buffers # strace -fe perf_event_open bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' strace: Process 3229968 attached [pid 3229968] +++ exited with 0 +++ perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 5 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 1, -1, PERF_FLAG_FD_CLOEXEC) = 6 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 2, -1, PERF_FLAG_FD_CLOEXEC) = 7 [...]
  • 93. BPF Internals (Brendan Gregg) Perf output buffers # strace -fe perf_event_open bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' strace: Process 3229968 attached [pid 3229968] +++ exited with 0 +++ perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 5 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 1, -1, PERF_FLAG_FD_CLOEXEC) = 6 perf_event_open({type=PERF_TYPE_SOFTWARE, ..., config=PERF_COUNT_SW_BPF_OUTPUT, ...}, -1, 2, -1, PERF_FLAG_FD_CLOEXEC) = 7 [...] CPU ID output buffer 5 CPU 0 output buffer 6 CPU 1 output buffer 7 CPU 2 ... bpftrace fd 5 fd 6 fd 7 bpftrace waits for events using epoll_wait(2)
  • 94. BPF Internals (Brendan Gregg) bpftrace mid-level internals 13/13 BPF Internals (Brendan Gregg)
  • 95. BPF Internals (Brendan Gregg) bpftrace printf & async actions enum class AsyncAction { // clang-format off printf = 0, // printf reserves 0-9999 for printf_ids syscall = 10000, // system reserves 10000-19999 for printf_ids cat = 20000, // cat reserves 20000-29999 for printf_ids exit = 30000, print, clear, [...] bpftrace src/types.h printf_id 64-bit arguments bpftrace perf output message format: E.g., printf_id 0: "PID %d sleeping...n" High numbers are used for other async actions:
  • 96. BPF Internals (Brendan Gregg) bpftrace printf & async actions (2) void perf_event_printer(void *cb_cookie, void *data, int size) { […] auto printf_id = *reinterpret_cast<uint64_t *>(arg_data); […] // async actions if (printf_id == asyncactionint(AsyncAction::exit)) { bpftrace->request_finalize(); return; } […] // printf auto fmt = std::get<0>(bpftrace->printf_args_[printf_id]); auto args = std::get<1>(bpftrace->printf_args_[printf_id]); auto arg_values = bpftrace->get_arg_values(args, arg_data); bpftrace->out_->message(MessageType::printf, format(fmt, arg_values), false); bpftrace src/bpftrace.cpp message() just prints it out
  • 97. BPF Internals (Brendan Gregg) # bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...n", pid); }' Attaching 1 probe... PID 10287 sleeping... PID 10297 sleeping... PID 10287 sleeping… PID 10297 sleeping... PID 10287 sleeping... PID 2218 sleeping... PID 10297 sleeping... [...] Final output
  • 98. BPF Internals (Brendan Gregg) 2. Static tracing and map summaries
  • 99. BPF Internals (Brendan Gregg) 2. Static tracing and map summaries bpftrace -e 'tracepoint:block:block_rq_issue { @[comm] = count(); }'
  • 100. BPF Internals (Brendan Gregg) # bpftrace -e 'tracepoint:block:block_rq_issue { @[comm] = count(); }' Attaching 1 probe… ^C @[kworker/2:2H]: 131 @[chrome]: 135 @[kworker/7:1H]: 185 @[Xorg]: 245 @[tar]: 1204 @[dmcrypt_write/2]: 1993 Example output
  • 101. BPF Internals (Brendan Gregg) bpftrace mid-level internals 1/4 BPF Internals (Brendan Gregg)
  • 102. BPF Internals (Brendan Gregg) map @{ident}|@ [...] Lexer & Yacc bpftrace src/lexer.l %token <std::string> MAP "map" [...] stmt : expr { $$ = new ast::ExprStatement($1, @1); } | compound_assignment { $$ = $1; } | jump_stmt { $$ = $1; } | map "=" expr { $$ = new ast::AssignMapStatement($1, $3, false, @2); } [...] map : MAP { $$ = new ast::Map($1, @$); } | MAP "[" vargs "]" { $$ = new ast::Map($1, $3, @$); } bpftrace src/parser.yy
  • 103. BPF Internals (Brendan Gregg) bpftrace mid-level internals 2/4 BPF Internals (Brendan Gregg)
  • 104. BPF Internals (Brendan Gregg) BPF maps ● Custom data storage ● Can be a key/value store (hash) ● Also used for histogram summaries (BPF code calculates a bucket index as the key) tar 1204 chrome 135 Xorg 245 keys values @
  • 105. BPF Internals (Brendan Gregg) BPF map operations User-space BPF-space (kernel) bpf(2) syscall BPF_MAP_CREATE BPF_MAP_LOOKUP_ELEM BPF_MAP_UPDATE_ELEM BPF_MAP_DELETE_ELEM BPF_MAP_GET_NEXT_KEY [...] BPF helpers bpf_map_lookup_elem() bpf_map_update_elem() bpf_map_delete_elem() [...]
  • 106. BPF Internals (Brendan Gregg) Creating BPF maps # strace -febpf bpftrace -e 'block:block_rq_issue { @[comm] = count(); }' bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=4, max_entries=1, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="@", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = -1 EINVAL (Invalid argument) bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4, max_entries=8, map_flags=0, inner_map_fd=0, map_name="printf", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 4 [...]
  • 107. BPF Internals (Brendan Gregg) Creating BPF maps # strace -febpf bpftrace -e 'block:block_rq_issue { @[comm] = count(); }' bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_ARRAY, key_size=4, value_size=4, max_entries=1, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="@", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = -1 EINVAL (Invalid argument) bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_HASH, key_size=16, value_size=8, max_entries=4096, map_flags=0, inner_map_fd=0, map_name="", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 3 bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERF_EVENT_ARRAY, key_size=4, value_size=4, max_entries=8, map_flags=0, inner_map_fd=0, map_name="printf", map_ifindex=0, btf_fd=0, btf_key_type_id=0, btf_value_type_id=0}, 120) = 4 [...]
  • 108. BPF Internals (Brendan Gregg) bpftrace mid-level internals 3/4 BPF Internals (Brendan Gregg)
  • 109. BPF Internals (Brendan Gregg) Tracepoint defined If ... DECLARE_EVENT_CLASS(block_rq, TP_PROTO(struct request_queue *q, struct request *rq), TP_ARGS(q, rq), TP_STRUCT__entry( __field( dev_t, dev ) __field( sector_t, sector ) __field( unsigned int, nr_sector ) __field( unsigned int, bytes ) __array( char, rwbs, RWBS_LEN ) __array( char, comm, TASK_COMM_LEN ) __dynamic_array( char, cmd, 1 ) ), TP_fast_assign( __entry->dev = rq->rq_disk ? disk_devt(rq->rq_disk) : 0; __entry->sector = blk_rq_trace_sector(rq); [...] DEFINE_EVENT(block_rq, block_rq_issue, TP_PROTO(struct request_queue *q, struct request *rq), TP_ARGS(q, rq) ); Linux include/trace/events/block.h
  • 110. BPF Internals (Brendan Gregg) Tracepoints in code If ... void blk_mq_start_request(struct request *rq) { struct request_queue *q = rq->q; trace_block_rq_issue(q, rq); if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) { rq->io_start_time_ns = ktime_get_ns(); rq->stats_sectors = blk_rq_sectors(rq); [...] Linux block/block-mq.c This is a (best effort) stable interface Use tracepoints instead of kprobes when possible!
  • 111. BPF Internals (Brendan Gregg) (gdb) disas/r blk_mq_start_request Dump of assembler code for function blk_mq_start_request: 0xffffffff815118e0 <+0>: e8 4b bf b5 ff callq 0xffffffff8106d830 <__fentry__> 0xffffffff815118e5 <+5>: 55 push %rbp 0xffffffff815118e6 <+6>: 48 89 e5 mov %rsp,%rbp 0xffffffff815118e9 <+9>: 41 55 push %r13 0xffffffff815118eb <+11>: 41 54 push %r12 0xffffffff815118ed <+13>: 49 89 fc mov %rdi,%r12 0xffffffff815118f0 <+16>: 53 push %rbx 0xffffffff815118f1 <+17>: 4c 8b 2f mov (%rdi),%r13 0xffffffff815118f4 <+20>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 0xffffffff815118f9 <+25>: 49 8b 45 60 mov 0x60(%r13),%rax [...] Instrumenting tracepoints (2) How do we include the tracepoint without adding overhead? (this is actually quite easy)
  • 112. BPF Internals (Brendan Gregg) (gdb) disas/r blk_mq_start_request Dump of assembler code for function blk_mq_start_request: 0xffffffff815118e0 <+0>: e8 4b bf b5 ff callq 0xffffffff8106d830 <__fentry__> 0xffffffff815118e5 <+5>: 55 push %rbp 0xffffffff815118e6 <+6>: 48 89 e5 mov %rsp,%rbp 0xffffffff815118e9 <+9>: 41 55 push %r13 0xffffffff815118eb <+11>: 41 54 push %r12 0xffffffff815118ed <+13>: 49 89 fc mov %rdi,%r12 0xffffffff815118f0 <+16>: 53 push %rbx 0xffffffff815118f1 <+17>: 4c 8b 2f mov (%rdi),%r13 0xffffffff815118f4 <+20>: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 0xffffffff815118f9 <+25>: 49 8b 45 60 mov 0x60(%r13),%rax [...] Instrumenting tracepoints (2) This 5-byte nop is a placeholder. Does nothing quickly. When the tracepoint is enabled, the nop becomes a jmp to the tracepoint trampoline. How do we include the tracepoint without adding overhead?
  • 113. BPF Internals (Brendan Gregg) bpftrace mid-level internals 4/4 BPF Internals (Brendan Gregg)
  • 114. BPF Internals (Brendan Gregg) User-space map iteration If ... int BPFtrace::print_map(IMap &map, uint32_t top, uint32_t div) […] while (bpf_get_next_key(map.mapfd_, old_key.data(), key.data()) == 0) { int value_size = map.type_.GetSize(); value_size *= nvalues; auto value = std::vector<uint8_t>(value_size); int err = bpf_lookup_elem(map.mapfd_, key.data(), value.data()); if (err == -1) { // key was removed by the eBPF program during bpf_get_next_key() and bpf_lookup_elem(), // let's skip this key continue; } else if (err) { LOG(ERROR) << "failed to look up elem: " << err; return -1; } values_by_key.push_back({key, value}); old_key = key; } bpftrace src/bpftrace.cpp libbcc/ libbpf bpf(2) syscall kernel
  • 115. BPF Internals (Brendan Gregg) Reading entire BPF maps # strace -febpf bpftrace -e 'block:block_rq_issue { @[comm] = count(); }' [...] bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e422133e0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = - 1 ENOENT (No such file or directory) bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 bpf(BPF_MAP_GET_NEXT_KEY, {map_fd=3, key=0x557e422133e0, next_key=0x557e4224d3f0}, 120) = 0 bpf(BPF_MAP_LOOKUP_ELEM, {map_fd=3, key=0x557e4224d3f0, value=0x557e4221eab0, flags=BPF_ANY}, 120) = 0 [...] This is an infrequent activity (this program only does this once)
  • 116. BPF Internals (Brendan Gregg) # bpftrace -e 'tracepoint:block:block_rq_issue { @[comm] = count(); }' Attaching 1 probe… ^C @[kworker/2:2H]: 131 @[chrome]: 135 @[kworker/7:1H]: 185 @[Xorg]: 245 @[tar]: 1204 @[dmcrypt_write/2]: 1993 Final output
  • 117. BPF Internals (Brendan Gregg) Stack walking BTF CO-RE Tests raw_tracepoints & fentry Networking & XDP Security & Cgroups Discussion of other internals
  • 118. BPF Internals (Brendan Gregg) BPF tracing/observability high-level recap From: BPF Performance Tools, Figure 2-1
  • 119. BPF Internals (Brendan Gregg) BPF mid-level internals recap From: BPF Performance Tools, Figure 2-3
  • 120. BPF Internals (Brendan Gregg) PSA CONFIG_DEBUG_INFO_BTF=y E.g., Ubuntu 20.10, Fedora 30, and RHEL 8.2 have it
  • 121. BPF Internals (Brendan Gregg) References This is also where I recommend you go to learn more: ● https://events.static.linuxfound.org/sites/events/files/slides/ bpf_collabsummit_2015feb20.pdf ● Linux include/uapi/linux/bpf_common.h ● Linux include/uapi/linux/bpf.h ● Linux include/uapi/linux/filter.h ● https://docs.cilium.io/en/v1.9/bpf/#bpf-guide ● BPF Performance Tools, Addison-Wesley 2020 ● https://ebpf.io/what-is-ebpf ● http://www.brendangregg.com/ebpf.html ● https://github.com/iovisor/bcc ● https://github.com/iovisor/bpftrace
  • 122. BPF Internals (Brendan Gregg) Thanks BPF: Alexei Starovoitov (Facebook), Daniel Borkmann (Isovalent), David S. Miller (Red Hat), Jakub Kicinski (Facebook), Yonghong Song (Facebook), Martin KaFai Lau (Facebook), John Fastabend (Isovalent), Quentin Monnet (Isovalent), Jesper Dangaard Brouer (Red Hat), Andrey Ignatov (Facebook), and Stanislav Fomichev (Google), Linus Torvalds, and many more in the BPF community LLVM BPF: Alexei Starovoitov, Chandler Carruth (Google), Yonghong Song, and more bpftrace: Alastair Robertson (Yellowbrick Data), Dan Xu (Facebook), Bas Smit, Mary Marchini (Netflix), Masanori Misono, Jiri Olsa, Viktor Malík, Dale Hamel, Willian Gaspar, Augusto Mecking Caringi, and many more in the bpftrace community USENIX Jun, 2021 https://ebpf.io