Network file systems offer a powerful, transparent interface
for accessing remote data. Unfortunately, in current network file systems like NFS, clients fetch data from a central file server, inherently limiting the system's ability to scale to many clients. While recent distributed (peer-to-peer) systems have managed to eliminate this scalability
bottleneck, they are often exceedingly complex and provide
non-standard models for administration and accountability. We present Shark, a novel system that retains the best of both worlds -- the scalability of distributed systems
with the simplicity of central servers.
Shark is a distributed file system designed for large-scale,
wide-area deployment, while also providing a drop-in replacement for local-area file systems. Shark introduces a novel cooperative-caching mechanism, in which
mutually-distrustful clients can exploit each others' file
caches to reduce load on an origin file server. Using a distributed index, Shark clients nd nearby copies of data, even when files originate from different servers. Performance results show that Shark can greatly reduce server
load and improve client latency for read-heavy workloads, both in the wide and local areas, while still remaining
competitive for single clients in the local area. Thus, Shark enables modestly-provisioned file servers to scale to hundreds of read-mostly clients while retaining traditional usability, consistency, security, and accountability.
Shark: Scaling File Servers via Cooperative Caching
1. Shark: Scaling File Servers via Cooperative
Caching
Siddhartha Annapureddy Michael J. Freedman David Mazi`res
e
New York University
Networked Systems Design and Implementation, 2005
Siddhartha Annapureddy Shark
2. Motivation
Scenario – Want to test a new application on a hundred nodes
Workstation Workstation
Workstation Workstation
Workstation Workstation
Workstation Workstation
Workstation Workstation
Problem – Need to push software to all nodes and collect results
Siddhartha Annapureddy Shark
3. Current Approaches
Distribute from a central server
Server Client
Network A Network B
Client Client
rsync, unison, scp
– Server’s network uplink saturates
Operation takes a long time to finish
– Wastes bandwidth along bottleneck links
Siddhartha Annapureddy Shark
4. Current Approaches
File distribution mechanisms
BitTorrent, Bullet
+ Scales by offloading burden on server
– Client downloads from half-way across the world
Siddhartha Annapureddy Shark
5. Inherent Problems with Copying Files
Users have to decide a priori what to ship
Ship too much – Waste bandwidth, takes long time
Ship too little – Hassle to work in a poor environment
Idle environments consume disk space
Users are loath to cleanup ⇒ Redistribution
Need low cost solution for refetching files
Manual management of software versions
Illusion of having development environment
Programs fetched transparently on demand
Siddhartha Annapureddy Shark
6. Networked File Systems
+ Know how to deploy these systems
+ Know how to administer such systems
+ Simple accountability mechanisms
Eg: NFS, AFS, SFS etc.
Siddhartha Annapureddy Shark
9. Peer-to-Peer Filesystems
P2P Filesystems (Pond, Ivy)
+ Scalability
– Non-standard administrative models
– New filesystem semantics
– Haven’t been widely deployed yet
Goal – Best of Both Worlds
Convenience of central servers
Scalability of peer-to-peer
Siddhartha Annapureddy Shark
10. Let’s Design this New Filesystem
Client Server
Vanilla SFS
Central server model
Siddhartha Annapureddy Shark
11. Scalability – Client-side Caching
Client Server
Cache
Large client cache ` la AFS
a
Scales by avoiding redundant data transfers
Leases to ensure NFS-style cache consistency
With multiple clients, must address bandwidth concerns
Siddhartha Annapureddy Shark
13. Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
Client
Client
Client Server
Client
Client
Shark clients maintain a distributed index
Siddhartha Annapureddy Shark
14. Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
Client
Client
Client Server
Client
Client
Shark clients maintain a distributed index
Fetch a file from multiple other clients in parallel
Siddhartha Annapureddy Shark
15. Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
Client
Client
Client Server
Client
Client
Shark clients maintain a distributed index
Fetch a file from multiple other clients in parallel
LBFS-style Chunks – Variable-sized blocks
Chunks are better – Exploit commonalities across files
Chunks preserved across file versions, concatenations
Siddhartha Annapureddy Shark
16. Scalability – Cooperative Caching
Clients fetch data from each other and offload burden from server
Client
Client
Client Server
Client
Client
Shark clients maintain a distributed index
Fetch a file from multiple other clients in parallel
Locality-awareness – Preferentially fetch chunks from nearby
clients
Siddhartha Annapureddy Shark
17. Cross-filesystem Sharing
Global cooperative cache regardless of origin servers
Server Server
A B
Client Client
Client Client
Client Client
Client Client
Client Client
Two groups of clients accessing servers A and B
Client groups share a large amount of software
Such clients automatically form a global cache
Siddhartha Annapureddy Shark
18. Cross-filesystem Sharing
Global cooperative cache regardless of origin servers
Server Server
A B
Client Client
Client Client
Client Client
Client Client
Client Client
Two groups of clients accessing servers A and B
Client groups share a large amount of software
Such clients automatically form a global cache
Siddhartha Annapureddy Shark
19. Application
Linux distribution with LiveCDs
Knoppix
Sharkcd
Slax
Knoppix
Slax
Sharkcd
Slax
Sharkcd Knoppix Knoppix
Mirror
Sharkcd
LiveCD – Run an entire OS without using hard disk
But all your programs must fit on a CD-ROM
Download dynamically from server but scalability problems
Knoppix and Slax users form global cache – Relieve servers
Siddhartha Annapureddy Shark
20. Cooperative Caching – Roadmap
Client
Fetch Metadata
Client Client Server
Client
Client
Fetch metadata from the server
Siddhartha Annapureddy Shark
21. Cooperative Caching – Roadmap
Client
Client Lookup Clients Client Server
Client
Client
Fetch metadata from the server
Look up clients caching needed chunks in overlay
Siddhartha Annapureddy Shark
22. Cooperative Caching – Roadmap
Client
Client Download Chunks Client Server
Client
Client
Fetch metadata from the server
Look up clients caching needed chunks in overlay
Connect to multiple clients and download chunks in parallel
Siddhartha Annapureddy Shark
23. Cooperative Caching – Roadmap
Client
Client Download Chunks Client Server
Client
Client
Fetch metadata from the server
Look up clients caching needed chunks in overlay
Connect to multiple clients and download chunks in parallel
Check integrity of fetched chunks
Clients are mutually distrustful
Siddhartha Annapureddy Shark
25. File Metadata
GET_TOKEN
(fh,offset,count)
Client Metadata:
Server
Chunk tokens
Possession of token implies server permissions to read
Siddhartha Annapureddy Shark
26. File Metadata
GET_TOKEN
(fh,offset,count)
Client Metadata:
Server
Chunk tokens
Possession of token implies server permissions to read
Tokens are a shared secret between authorized clients
Siddhartha Annapureddy Shark
27. File Metadata
GET_TOKEN
(fh,offset,count)
Client Metadata:
Server
Chunk tokens
Possession of token implies server permissions to read
Tokens are a shared secret between authorized clients
Given a chunk B...
Chunk token TB = H(B)
H is a collision resistant hash function
Siddhartha Annapureddy Shark
28. File Metadata
GET_TOKEN
(fh,offset,count)
Client Metadata:
Server
Chunk tokens
Possession of token implies server permissions to read
Tokens are a shared secret between authorized clients
Tokens can be used to check integrity of fetched data
Given a chunk B...
Chunk token TB = H(B)
H is a collision resistant hash function
Siddhartha Annapureddy Shark
29. Discovering Clients Caching Chunks
Client
Client Discover Clients Client Server
Client
Client
For every chunk B, there’s indexing key IB
IB used to index clients caching B
Cannot set IB = TB , as TB is a secret
IB = MAC (TB , “Indexing Key”)
Siddhartha Annapureddy Shark
30. Locality Awareness
Client Client
Network A Network B
Client Client
Overlay organized as clusters based on latency
Indexing infrastructure preferentially returns sources in same
cluster as the client
Hence, chunks usually transferred from nearby clients
Siddhartha Annapureddy Shark
31. Final Steps
Download Chunk
Security issues discussed later
Register as a source
Client now becomes a source for the downloaded chunk
Client registers in distributed index – PUT(IB , Addr)
Chunk Reconciliation
Reuse connection to download more chunks
Exchange mutually needed chunks w/o indexing overhead
Siddhartha Annapureddy Shark
32. Security Issues – Client Authentication
Client
Traditional
Client Client Authentication
Server
Client
Client
Traditionally, server authenticated read requests using uids
Siddhartha Annapureddy Shark
33. Security Issues – Client Authentication
Client
Authenticator Derived Traditional
Client From Token
Client Authentication
Server
Client
Client
Traditionally, server authenticated read requests using uids
Challenge – How does a client know when to send chunks?
Chunk token allows client to identify authorized clients
Siddhartha Annapureddy Shark
34. Security Issues – Client Communication
Client should be able to check integrity of downloaded chunk
Client should not send chunks to other unauthorized clients
An eavesdropper shouldn’t be able to obtain chunk contents
Siddhartha Annapureddy Shark
36. Security Protocol
RC
RP
Receiver Source
Client Client
RC , RP – Random nonces to ensure freshness
Siddhartha Annapureddy Shark
37. Security Protocol
RC
RP
Receiver Source
Client IB, AuthC Client
EK(B)
RC , RP – Random nonces to ensure freshness
AuthC – Authenticator to prove receiver has token
AuthC = MAC (TB , “Auth C”, C, P, RC , RP )
Siddhartha Annapureddy Shark
38. Security Protocol
RC
RP
Receiver Source
Client IB, AuthC Client
EK(B)
RC , RP – Random nonces to ensure freshness
AuthC – Authenticator to prove receiver has token
AuthC = MAC (TB , “Auth C”, C, P, RC , RP )
K – Key to encrypt chunk contents
K = MAC (TB , “Encryption”, C, P, RC , RP )
Siddhartha Annapureddy Shark
39. Security Properties
RC
RP
Receiver Source
Client IB, AuthC Client
EK(B)
Client can check integrity of downloaded chunk
?
Client checks H(Downloaded chunk) = TB
Source should not send chunks to unauthorized clients
Malicious clients cannot send correct AuthC
Eavesdropper shouldn’t get chunk contents
All communication encrypted with K
Siddhartha Annapureddy Shark
40. Security Properties
RC
RP
Receiver Source
Client IB, AuthC Client
EK(B)
Privacy limitations for world-readable files
Eavesdropper can track lookups of clients
Eavesdropper hashes data, finds what exactly client downloads
For private files, solution described in paper
Sacrifices cross-FS sharing for better privacy
Forward Secrecy not guaranteed
Siddhartha Annapureddy Shark
41. Implementation
Server
Sharksd
NFS V3
Server
Client Client
Application Sharkcd Sharkcd Application
Syscall Syscall
Corald Corald
NFS V3 NFS V3
Client Client
sharkcd – Incorporates source-receiver client functionality
sharksd – Incorporates chunking mechanisms
corald – A node in the indexing infrastructure
Siddhartha Annapureddy Shark
42. Evaluation
How does Shark compare with SFS? With NFS?
How scalable is the server?
How fair is Shark across clients?
Which order is better? Random or Sequential
What are the benefits of set reconciliation?
Siddhartha Annapureddy Shark
43. Emulab – 100 Nodes on LAN
100
Percentage completed within time
80
60
40
40 MB read
Shark, rand, negotiation
20 NFS
SFS
0
0 100 200 300 400 500 600 700 800
Time since initialization (sec)
Shark – 88s
SFS – 775s (≈ 9x better), NFS – 350s (≈ 4x better)
SFS less fair because of TCP backoffs
Siddhartha Annapureddy Shark
44. PlanetLab – 185 Nodes
100
Percentage completed within time
80
60
40
20
40 MB read
Shark
SFS
0
0 500 1000 1500 2000 2500
Time since initialization (sec)
Shark ≈ 7 min – 95th Percentile
SFS ≈ 39 min – 95th Percentile (5x better)
NFS – Triggered kernel panics in server
Siddhartha Annapureddy Shark
45. Data pushed by Server
7400
Normalized bandwidth
1.0
0.5
954
0.0
SFS Shark
Shark vs SFS – 23 copies vs 185 copies (8x better)
Siddhartha Annapureddy Shark
47. Fetching Chunks – Order Matters
In what order should we fetch chunks of a file?
Natural choices – Random or Sequential
Intuitively, when many clients start simultaneously
Random
All clients fetch independent chunks
More chunks become available in the cooperative cache
Sequential
Better disk I/O scheduling on the server
Client that downloads most chunks alone adds to cache
Siddhartha Annapureddy Shark
48. Emulab – 100 Nodes on LAN
100
Percentage completed within time
80
60
40
40 MB read
Shark, rand
20 Shark, seq
0
0 100 200 300 400 500 600 700 800
Time since initialization (sec)
Random – 133s
Sequential – 203s
Random Wins !!! – 35% better
Siddhartha Annapureddy Shark
49. Emulab – 100 Nodes on LAN
100
Percentage completed within time
80
60
40
40 MB read
Shark, rand
20 Shark, rand, negotiation
0
0 100 200 300 400 500 600 700 800
Time since initialization (sec)
Random + Reconciliation – 88s
Random – 133s
Reconciliation crucial – 34% improvement
Siddhartha Annapureddy Shark
50. Conclusions
Networked filesystems offer a convenient interface
Current networked filesystems like NFS are not scalable
Forces users to resort to inconvenient tools like rsync
Shark offers a filesystem that scales to hundreds of clients
Locality-aware cooperative cache
Supports cross-FS sharing enabling novel applications
Siddhartha Annapureddy Shark