2. History of sendfile(2) Before sendfile(2)
Miserable life w/o sendfile(2)
while ((cnt = read(filefd, buf, (u_int)blksize))
write(netfd, buf, cnt) == cnt)
byte_count += cnt;
send_data() в src/libexec/ftpd/ftpd.c,
FreeBSD 1.0, 1993
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 2 / 23
3. History of sendfile(2) sendfile(2) introduced
sendfile(2) introduced
int
sendfile(int fd, int s, off_t offset, size_t nbytes, .. );
1997: HP-UX 11.00
1998: FreeBSD 3.0 and Linux 2.2
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 3 / 23
4. History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
5. History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Further optimisations:
2004: SF_NODISKIO flag
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
6. History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Further optimisations:
2004: SF_NODISKIO flag
2006: inner cycle, working on sbspace() bytes
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
7. History of sendfile(2) sendfile(2) in FreeBSD
sendfile(2) in FreeBSD
First implementation - mapping userland cycle to the
kernel:
read(filefd) → VOP_READ(vnode)
write(netfd) → sosend(socket)
blksize → PAGE_SIZE
Further optimisations:
2004: SF_NODISKIO flag
2006: inner cycle, working on sbspace() bytes
2013: sending a shared memory descriptor data
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 4 / 23
8. What’s not right with sendfile(2) blocking on I/O
Problem #1: blocking on I/O
Algorithm of a modern HTTP-server:
1 Take yet another descriptor from kevent(2)
2 Do write(2)/read(2)/sendfile(2) on it
3 Go to 1
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
9. What’s not right with sendfile(2) blocking on I/O
Problem #1: blocking on I/O
Algorithm of a modern HTTP-server:
1 Take yet another descriptor from kevent(2)
2 Do write(2)/read(2)/sendfile(2) on it
3 Go to 1
Bottleneck: any syscall time.
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 5 / 23
10. What’s not right with sendfile(2) blocking on I/O
Attempts to solve problem #1
Separate I/O contexts: processes, threads
Apache
nginx 2
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
11. What’s not right with sendfile(2) blocking on I/O
Attempts to solve problem #1
Separate I/O contexts: processes, threads
Apache
nginx 2
SF_NODISKIO + aio_read(2)
nginx
Varnish
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 6 / 23
12. What’s not right with sendfile(2) blocking on I/O
More attempts . . .
aio_mlock(2) instead of aio_read(2)
aio_sendfile(2) ???
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 7 / 23
13. What’s not right with sendfile(2) control over
Problem #2: control over VM
VOP_READ() leaves pages in VM cache
VOP_READ() [for UFS] does readahead
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
14. What’s not right with sendfile(2) control over
Problem #2: control over VM
VOP_READ() leaves pages in VM cache
VOP_READ() [for UFS] does readahead
Not easy to prevent it doing that!
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 8 / 23
15. New sendfile(2) implementation above pager
waht if VOP_GETPAGES()?
VOP_READ() → VOP_GETPAGES()
Pros:
sendfile() already works on pages
implementations for vnode and shmem converge
control over VM is now easier task
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
16. New sendfile(2) implementation above pager
waht if VOP_GETPAGES()?
VOP_READ() → VOP_GETPAGES()
Pros:
sendfile() already works on pages
implementations for vnode and shmem converge
control over VM is now easier task
Cons
Losing readahead heuristics
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
17. New sendfile(2) implementation above pager
waht if VOP_GETPAGES()?
VOP_READ() → VOP_GETPAGES()
Pros:
sendfile() already works on pages
implementations for vnode and shmem converge
control over VM is now easier task
Cons
Losing readahead heuristics
But no one used them!
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 9 / 23
18. New sendfile(2) VOP_GETPAGES_ASYNC()
VOP_GETPAGES_ASYNC()
int
VOP_GETPAGES(struct vnode *vp, vm_page_t *ma,
int count, int reqpage);
1 Initialize buf(9)
2 buf->b_iodone = bdone;
3 bstrategy(buf);
4 bwait(buf); /* sleeps until I/O completes */
5 return;
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
19. New sendfile(2) VOP_GETPAGES_ASYNC()
VOP_GETPAGES_ASYNC()
int
VOP_GETPAGES_ASYNC(struct vnode *vp,
vm_page_t *ma, int count, int reqpage,
vop_getpages_iodone_t *iodone, void *arg);
1 Initialize buf(9)
2 buf->b_iodone = vnode_pager_async_iodone;
3 bstrategy(buf);
4 return;
vnode_pager_async_iodone calls iodone() .
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 10 / 23
20. New sendfile(2) non-blocking sendfile(2)
naive non-blocking sendfile(2)
In kern_sendfile():
1 nios++;
2 VOP_GETPAGES_ASYNC(sendfile_iodone);
In sendfile_iodone():
1 nios--;
2 if (nios) return;
3 sosend();
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 11 / 23
21. New sendfile(2) non-blocking sendfile(2)
the problem of naive implementation
sendfile(filefd, sockfd, ..);
write(sockfd, ..);
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 12 / 23
22. New sendfile(2) “not ready” data in socket buffers
socket buffer
mbuf mbuf mbuf mbuf mbuf mbuf
struct sockbuf
struct mbuf *sb_mb
struct mbuf *sb_mbtail
u_int sb_cc
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 13 / 23
23. New sendfile(2) “not ready” data in socket buffers
socket buffer with “not ready” data
mbuf mbuf mbuf mbuf mbuf mbuf
page page
struct sockbuf
struct mbuf *sb_mb
struct mbuf *sb_fnrdy
struct mbuf *sb_mbtail
u_int sb_acc
u_int sb_ccc
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 14 / 23
24. New sendfile(2) final implementation
non-blocking sendfile(2)
In kern_sendfile():
1 nios++;
2 VOP_GETPAGES_ASYNC(sendfile_iodone);
3 sosend(NOT_READY);
In sendfile_iodone():
1 nios--;
2 if (nios) return;
3 soready();
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 15 / 23
25. New sendfile(2) comparison with old sendfile(2)
traffic
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 16 / 23
26. New sendfile(2) comparison with old sendfile(2)
CPU idle
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 17 / 23
27. New sendfile(2) comparison with old sendfile(2)
profiling sendfile(2) in head
aio_daemon 13.64%
sys_sendfile 7.40%
t4_intr 5.66%
xpt_done 1.04%
pagedaemon 4.16%
scheduler 5.28%
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 18 / 23
28. New sendfile(2) comparison with old sendfile(2)
profiling new sendfile(2)
sys_sendfile 16.9%
t4_intr 8.17%
xpt_done 9.91%
pagedaemon 6.54%
scheduler 3.58%
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
29. New sendfile(2) comparison with old sendfile(2)
profiling new sendfile(2)
sys_sendfile 16.9% (vm_page_grab 9.24% !!)
t4_intr 8.17% (tcp_output() 2.07% !!)
xpt_done 9.91% (m_freem() 3.11% !!)
pagedaemon 6.54%
scheduler 3.58%
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 19 / 23
30. New sendfile(2) comparison with old sendfile(2)
what did change?
New code always sends full socket buffer
Which is good for TCP (as protocol)
Which hurts VM, mbuf allocator,
and unexpectedly TCP stack
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
31. New sendfile(2) comparison with old sendfile(2)
what did change?
New code always sends full socket buffer
Which is good for TCP (as protocol)
Which hurts VM, mbuf allocator,
and unexpectedly TCP stack
Will fix that!
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 20 / 23
32. New sendfile(2) comparison with old sendfile(2)
old sendfile(2) @ Netflix
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
33. New sendfile(2) comparison with old sendfile(2)
new sendfile(2) @ Netflix
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 21 / 23
34. New sendfile(2) plans and problems
TODO list
Problems:
VM & I/O overcommit
ZFS
SCTP
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23
35. New sendfile(2) plans and problems
TODO list
Problems:
VM & I/O overcommit
ZFS
SCTP
Future plans:
sendfile(2) doing TLS
Gleb Smirnoff glebius@FreeBSD.org New sendfile(2) 20 February 2015 22 / 23