SlideShare ist ein Scribd-Unternehmen logo
1 von 87
Downloaden Sie, um offline zu lesen
Scaling Data Servers via Cooperative Caching

                  Siddhartha Annapureddy

                         New York University


                           Committee:
    David Mazi`res, Lakshminarayanan Subramanian, Jinyang Li,
              e
                    Dennis Shasha, Zvi Kedem




            Siddhartha Annapureddy   Thesis Defense
Motivation




                   New functionality over the Internet

      Large number of users consume many (large) files
             YouTube, Azureus etc. for movies – DVDs no more
             Data dissemination to large distributed farms



                    Siddhartha Annapureddy   Thesis Defense
Current approaches


                                  Server




                Client             Client          Client




               Siddhartha Annapureddy   Thesis Defense
Current approaches


                                      Server




                    Client             Client          Client



      Scalability ↓ – Server’s uplink saturates
          Operation takes long time to finish



                  Siddhartha Annapureddy   Thesis Defense
Current approaches (contd.)



      Server farms replace central server
          Scalability problem persists
          Notoriously expensive to maintain




      Content Distribution Networks
      (CDNs)
          Still quite expensive


      Peer-to-peer (P2P) systems
          A new development

                  Siddhartha Annapureddy   Thesis Defense
Peer-to-peer (P2P) systems


      Low-end central server
      Large client population

      File usually divided into chunks/blocks
      Use client resources to disseminate file
          Bandwidth, disk space etc.
      Clients organized into a connected overlay
          Each client knows only a few other clients


         How to design P2P systems to scale data servers?




                  Siddhartha Annapureddy   Thesis Defense
P2P scalability – Design principles



   In order to achieve scalability, need to avoid hotspots

       Keep bw load on server to a minimum
            Server gives data only to few users
                 Leverage participants to distribute to other users
            Maintenance overhead with users to be kept low

       Keep bw load on users to a minimum
            Each node interacts with only a small subset of users




                    Siddhartha Annapureddy   Thesis Defense
P2P systems – Themes


      Usability – Exploiting P2P for a range of systems
          File systems, user-level apps etc.
      Security issues in such wide-area P2P systems
          Providing data integrity, privacy, authentication

      Cross file(system) sharing – Use other groups
          Make use of all available sources
      Topology management – Arrange nodes efficiently
          Underlying network (locality properties)
          Node capacities (bandwidth, disk space etc.)
          Application-specific (certain nodes together work well)
      Scheduling dissemination of chunks efficiently
          Nodes only have partial information



                  Siddhartha Annapureddy   Thesis Defense
P2P systems – Themes


      Usability – Exploiting P2P for a range of systems
          File systems, user-level apps etc.
      Security issues in such wide-area P2P systems
          Providing data integrity, privacy, authentication

      Cross file(system) sharing – Use other groups
          Make use of all available sources
      Topology management – Arrange nodes efficiently
          Underlying network (locality properties)
          Node capacities (bandwidth, disk space etc.)
          Application-specific (certain nodes together work well)
      Scheduling dissemination of chunks efficiently
          Nodes only have partial information



                  Siddhartha Annapureddy   Thesis Defense
P2P systems – Themes


      Usability – Exploiting P2P for a range of systems
          File systems, user-level apps etc.
      Security issues in such wide-area P2P systems
          Providing data integrity, privacy, authentication

      Cross file(system) sharing – Use other groups
          Make use of all available sources
      Topology management – Arrange nodes efficiently
          Underlying network (locality properties)
          Node capacities (bandwidth, disk space etc.)
          Application-specific (certain nodes together work well)
      Scheduling dissemination of chunks efficiently
          Nodes only have partial information



                  Siddhartha Annapureddy   Thesis Defense
P2P systems – Themes


      Usability – Exploiting P2P for a range of systems
          File systems, user-level apps etc.
      Security issues in such wide-area P2P systems
          Providing data integrity, privacy, authentication

      Cross file(system) sharing – Use other groups
          Make use of all available sources
      Topology management – Arrange nodes efficiently
          Underlying network (locality properties)
          Node capacities (bandwidth, disk space etc.)
          Application-specific (certain nodes together work well)
      Scheduling dissemination of chunks efficiently
          Nodes only have partial information



                  Siddhartha Annapureddy   Thesis Defense
Shark – Motivation

    Scenario – Want to test a new application on a hundred nodes

                                  Workstation   Workstation


                    Workstation                               Workstation




             Workstation                                             Workstation




                    Workstation                               Workstation


                                  Workstation   Workstation



    Problem – Need to push software to all nodes and collect results

                      Siddhartha Annapureddy     Thesis Defense
Current approaches


                    Distribute from a central server
         Server                                             Client

                    Network A                Network B

         Client                                             Client
                         rsync, unison, scp

    – Server’s network uplink saturates
    – Wastes bandwidth along bottleneck links


                  Siddhartha Annapureddy   Thesis Defense
Current approaches

                      File distribution mechanisms




                         BitTorrent, Bullet

    + Scales by offloading burden on server
    – Client downloads from half-way across the world
                  Siddhartha Annapureddy   Thesis Defense
Inherent problems with copying files


   Users have to decide a priori what to ship
       Ship too much – Waste bandwidth, takes long time
       Ship too little – Hassle to work in a poor environment
   Idle environments consume disk space
       Users are loath to cleanup ⇒ Redistribution
       Need low cost solution for refetching files
   Manual management of software versions
       Point /usr/local to a central server

              Illusion of having development environment
              Programs fetched transparently on demand



                   Siddhartha Annapureddy   Thesis Defense
Networked file systems




    + Know how to deploy these systems
    + Know how to administer such systems
    + Simple accountability mechanisms
    Eg: NFS, AFS, SFS etc.




                 Siddhartha Annapureddy   Thesis Defense
Problem: Scalability
                               800


                               700
   Time to finish Read (sec)



                               600


                               500


                               400                                                 40 MB read
                                                                                         SFS

                               300


                               200


                               100


                                0
                                     0    20               40             60               80   100
                                                            Unique nodes


                                                Very slow ≈ 775s
                                         Much better !!! ≈ 88s (9x better)
                                         Siddhartha Annapureddy   Thesis Defense
Problem: Scalability
                               800


                               700
   Time to finish Read (sec)



                               600


                               500


                               400                                                 40 MB read
                                                                                         SFS
                                                                                         Shark
                               300


                               200


                               100


                                0
                                     0    20               40             60               80    100
                                                            Unique nodes


                                                Very slow ≈ 775s
                                         Much better !!! ≈ 88s (9x better)
                                         Siddhartha Annapureddy   Thesis Defense
Let’s design this new filesystem




                 Client                                     Server

   Vanilla SFS
       Central server model




                  Siddhartha Annapureddy   Thesis Defense
Scalability – Client-side caching




                 Client
                                                            Server
                 Cache

   Large client cache ` la AFS
                      a
       Whole file caching
       Scales by avoiding redundant data transfers
       Leases to ensure NFS-style cache consistency

                  Siddhartha Annapureddy   Thesis Defense
Scalability – Cooperative caching



                          Client

                                                  Client
           Client                                              Server


                                         Client
                    Client




       With multiple clients, must address bandwidth concerns




                    Siddhartha Annapureddy    Thesis Defense
Scalability – Cooperative caching

    Clients fetch data from each other and offload burden from server

                           Client

                                                   Client
            Client                                              Server


                                          Client
                     Client




       Shark clients maintain distributed index




                     Siddhartha Annapureddy    Thesis Defense
Scalability – Cooperative caching

    Clients fetch data from each other and offload burden from server

                           Client

                                                   Client
            Client                                              Server


                                          Client
                     Client




       Shark clients maintain distributed index
       Fetch a file from multiple other clients in parallel




                     Siddhartha Annapureddy    Thesis Defense
Scalability – Cooperative caching

    Clients fetch data from each other and offload burden from server

                           Client

                                                   Client
            Client                                              Server


                                          Client
                     Client




       Shark clients maintain distributed index
       Fetch a file from multiple other clients in parallel
       Read-heavy workloads – Writes still go to server


                     Siddhartha Annapureddy    Thesis Defense
P2P systems – Themes




      Usability – Exploiting P2P for a range of systems
      Cross file(system) sharing – Use other groups
      Security issues in such wide-area P2P systems
      Topology management – Arrange nodes efficiently
      Scheduling dissemination of chunks efficiently




                  Siddhartha Annapureddy   Thesis Defense
Cross file(system) sharing – Application

                    Linux distribution with LiveCDs

                                                                            Knoppix
                                    Slax     Sharkcd


                                                        Knoppix
    Slax
                                                                  Sharkcd

                       Slax

                        Sharkcd            Knoppix                          Knoppix
                                                                             Mirror
                                              Sharkcd



       LiveCD – Run an entire OS without using hard disk
       But all your programs must fit on a CD-ROM
       Download dynamically from server but scalability problems
       Knoppix and Slax users form global cache – Relieve servers
                  Siddhartha Annapureddy     Thesis Defense
Cross file(system) sharing

              Global cooperative cache regardless of origin servers

                       Slax                                                  Knoppix




                   Client                                                        Client

                                           Client          Client
     Client                                                                                     Client


                                  Client                            Client
              Client                                                                   Client




         Two groups of clients accessing Slax/Knoppix servers
         Client groups share a large amount of software
         Such clients automatically form a global cache
                              Siddhartha Annapureddy   Thesis Defense
Cross file(system) sharing

              Global cooperative cache regardless of origin servers

                       Slax                                                  Knoppix




                   Client                                                        Client

                                           Client          Client
     Client                                                                                     Client


                                  Client                            Client
              Client                                                                   Client




         Two groups of clients accessing Slax/Knoppix servers
         Client groups share a large amount of software
         Such clients automatically form a global cache
                              Siddhartha Annapureddy   Thesis Defense
Cross file(system) sharing

              Global cooperative cache regardless of origin servers

                       Slax                                                  Knoppix




                   Client                                                        Client

                                           Client          Client
     Client                                                                                     Client


                                  Client                            Client
              Client                                                                   Client




         Two groups of clients accessing Slax/Knoppix servers
         Client groups share a large amount of software
         BitTorrent supports only per-file groups
                              Siddhartha Annapureddy   Thesis Defense
Content-based chunking



                          Client

                                                  Client
           Client                                                Server


                                         Client
                     Client




      Single overlay for all filesystems and content-based chunks
      Chunks – Variable-sized blocks based on content
            Chunks are better – Exploit commonalities across files
            LBFS-style chunks preserved across versions, concatenations
      [Muthitacharoen et al. A Low-Bandwidth Network File System. SOSP ’01]
                    Siddhartha Annapureddy    Thesis Defense
P2P systems – Themes




      Usability – Exploiting P2P for a range of systems
      Cross file(system) sharing – Use other groups
      Security issues in such wide-area P2P systems
      Topology management – Arrange nodes efficiently
      Scheduling dissemination of chunks efficiently




                  Siddhartha Annapureddy   Thesis Defense
Security issues – Client authentication

                  Client


                                                               Traditional
     Client                                 Client            Authentication
                                                                               Server


              Client
                                Client




       Traditionally, server authenticated read requests using uids




                   Siddhartha Annapureddy    Thesis Defense
Security issues – Client authentication

                  Client


               Authenticator Derived                            Traditional
     Client        From Token
                                             Client            Authentication
                                                                                Server


              Client
                                 Client




       Traditionally, server authenticated read requests using uids
       Challenge – Interaction between distrustful clients



                    Siddhartha Annapureddy    Thesis Defense
Security issues – Client communication




      Client should be able to check integrity of downloaded chunk
      Client should not send chunks to other unauthorized clients
      An eavesdropper shouldn’t be able to obtain chunk contents

      Accomplish this without a complex PKI
      Keep the number of rounds to a minimum




                  Siddhartha Annapureddy   Thesis Defense
File metadata



                                   GET_TOKEN
                                   (fh,offset,count)
                  Client            Metadata:
                                                             Server
                                   Chunk tokens


        Possession of token implies server permissions to read
        Tokens are a shared secret between authorized clients
        Tokens can be used to check integrity of fetched data
   Given a chunk B...
       Chunk token TB = H(B)
       H is a collision resistant hash function

                   Siddhartha Annapureddy   Thesis Defense
File metadata



                                   GET_TOKEN
                                   (fh,offset,count)
                  Client            Metadata:
                                                             Server
                                   Chunk tokens


        Possession of token implies server permissions to read
        Tokens are a shared secret between authorized clients
        Tokens can be used to check integrity of fetched data
   Given a chunk B...
       Chunk token TB = H(B)
       H is a collision resistant hash function

                   Siddhartha Annapureddy   Thesis Defense
File metadata



                                   GET_TOKEN
                                   (fh,offset,count)
                  Client            Metadata:
                                                             Server
                                   Chunk tokens


        Possession of token implies server permissions to read
        Tokens are a shared secret between authorized clients
        Tokens can be used to check integrity of fetched data
   Given a chunk B...
       Chunk token TB = H(B)
       H is a collision resistant hash function

                   Siddhartha Annapureddy   Thesis Defense
File metadata



                                   GET_TOKEN
                                   (fh,offset,count)
                  Client            Metadata:
                                                             Server
                                   Chunk tokens


        Possession of token implies server permissions to read
        Tokens are a shared secret between authorized clients
        Tokens can be used to check integrity of fetched data
   Given a chunk B...
       Chunk token TB = H(B)
       H is a collision resistant hash function

                   Siddhartha Annapureddy   Thesis Defense
File metadata



                                   GET_TOKEN
                                   (fh,offset,count)
                  Client            Metadata:
                                                             Server
                                   Chunk tokens


        Possession of token implies server permissions to read
        Tokens are a shared secret between authorized clients
        Tokens can be used to check integrity of fetched data
   Given a chunk B...
       Chunk token TB = H(B)
       H is a collision resistant hash function

                   Siddhartha Annapureddy   Thesis Defense
Security protocol



                Receiver                                    Source
                 Client                                     Client



       RC , RP – Random nonces to ensure freshness

       AuthC – Authenticator to prove receiver has token
           AuthC = MAC (TB , “Auth C”, C, P, RC , RP )

       K – Key to encrypt chunk contents
           K = MAC (TB , “Encryption”, C, P, RC , RP )


                  Siddhartha Annapureddy   Thesis Defense
Security protocol

                                           RC
                                           RP
                Receiver                                     Source
                 Client                                      Client



       RC , RP – Random nonces to ensure freshness

       AuthC – Authenticator to prove receiver has token
           AuthC = MAC (TB , “Auth C”, C, P, RC , RP )

       K – Key to encrypt chunk contents
           K = MAC (TB , “Encryption”, C, P, RC , RP )


                  Siddhartha Annapureddy    Thesis Defense
Security protocol

                                           RC
                                           RP
                Receiver                                     Source
                 Client                IB, AuthC             Client
                                           EK(B)


       RC , RP – Random nonces to ensure freshness

       AuthC – Authenticator to prove receiver has token
           AuthC = MAC (TB , “Auth C”, C, P, RC , RP )

       K – Key to encrypt chunk contents
           K = MAC (TB , “Encryption”, C, P, RC , RP )


                  Siddhartha Annapureddy    Thesis Defense
Security protocol

                                           RC
                                           RP
                Receiver                                     Source
                 Client                IB, AuthC             Client
                                           EK(B)


       RC , RP – Random nonces to ensure freshness

       AuthC – Authenticator to prove receiver has token
           AuthC = MAC (TB , “Auth C”, C, P, RC , RP )

       K – Key to encrypt chunk contents
           K = MAC (TB , “Encryption”, C, P, RC , RP )


                  Siddhartha Annapureddy    Thesis Defense
Security properties

                                           RC
                                           RP
                Receiver                                     Source
                 Client                IB, AuthC             Client
                                           EK(B)

   Client can check integrity of downloaded chunk
                                                      ?
       Client checks H(Downloaded chunk) = TB
   Source should not send chunks to unauthorized clients
       Malicious clients cannot send correct AuthC
   Eavesdropper shouldn’t get chunk contents
       All communication encrypted with K


                  Siddhartha Annapureddy    Thesis Defense
P2P systems – Themes




      Usability – Exploiting P2P for a range of systems
      Cross file(system) sharing – Use other groups
      Security issues in such wide-area P2P systems
      Topology management – Arrange nodes efficiently
      Scheduling dissemination of chunks efficiently




                  Siddhartha Annapureddy   Thesis Defense
Topology management



                 Client


                                                             Fetch Metadata
    Client                                 Client                             Server


             Client
                               Client



     Fetch metadata from the server
     Look up clients caching needed chunks in overlay
     Overlay is common to all file systems
     Overlay organized like a DHT – Get and Put
                  Siddhartha Annapureddy    Thesis Defense
Topology management



                 Client



    Client        Lookup Clients           Client            Server


             Client
                               Client



     Fetch metadata from the server
     Look up clients caching needed chunks in overlay
     Overlay is common to all file systems
     Overlay organized like a DHT – Get and Put
                  Siddhartha Annapureddy    Thesis Defense
Topology management



                 Client



    Client        Lookup Clients           Client            Server


             Client
                               Client



     Fetch metadata from the server
     Look up clients caching needed chunks in overlay
     Overlay is common to all file systems
     Overlay organized like a DHT – Get and Put
                  Siddhartha Annapureddy    Thesis Defense
Topology management



                 Client



    Client        Lookup Clients           Client            Server


             Client
                               Client



     Fetch metadata from the server
     Look up clients caching needed chunks in overlay
     Overlay is common to all file systems
     Overlay organized like a DHT – Get and Put
                  Siddhartha Annapureddy    Thesis Defense
Overlay – Get operation



                    Client



     Client        Discover Clients            Client            Server


               Client
                                   Client



      For every chunk B, there’s indexing key IB
      IB used to index clients caching B
      Cannot set IB = TB , as TB is a secret
              IB = MAC (TB , “Indexing Key”)
                      Siddhartha Annapureddy    Thesis Defense
Overlay – Put operation




   Register as a source
       Client now becomes a source for the downloaded chunk
       Client registers in distributed index – PUT(IB , Addr)
   Chunk Reconciliation
       Reuse TCP connection to download more chunks
       Exchange mutually needed chunks w/o indexing overhead




                   Siddhartha Annapureddy   Thesis Defense
Overlay – Locality awareness



                 Client                                             Client

                             Network A              Network B

                 Client                                             Client




        Overlay organized as clusters based on latency
        Preferentially returns sources in same cluster as the client
        Hence, chunks usually transferred from nearby clients
   [Freedman et al. Democratizing content publication with Coral. NSDI ’04]
                       Siddhartha Annapureddy   Thesis Defense
Topology management – Shark vs. BitTorrent



 BitTorrent
                                                                      Server
     Neighbours “fixed”
     Makes cross file sharing difficult
                                                                               Client
 Shark                                                       Client

     Chunk-based overlay
     Neighbours based on data                                                  Client
                                                              Client
     Enables cross file sharing




                   Siddhartha Annapureddy   Thesis Defense
Evaluation




      How does Shark compare with SFS? With NFS?
      How scalable is the server?
      How fair is Shark across clients?
      Which order is better? Random or Sequential




                  Siddhartha Annapureddy   Thesis Defense
Emulab – 100 Nodes on LAN

                                     100
  Percentage completed within time




                                      80



                                      60



                                      40

                                                                                                 40 MB read
                                                                                            Shark, rand, negotiation
                                      20                                                    NFS
                                                                                            SFS



                                      0
                                           0   100     200       300       400       500          600      700         800
                                                             Time since initialization (sec)
                                       Shark – 88s
                                       SFS – 775s (≈ 9x better), NFS – 350s (≈ 4x better)
                                       SFS less fair because of TCP backoffs

                                                     Siddhartha Annapureddy      Thesis Defense
PlanetLab – 185 Nodes

                                        100
     Percentage completed within time

                                         80



                                         60



                                         40



                                         20
                                                                                                     40 MB read
                                                                                                           Shark
                                                                                                           SFS
                                          0
                                              0      500            1000          1500        2000                 2500
                                                            Time since initialization (sec)


                        Shark ≈ 7 min – 95th Percentile
                        SFS ≈ 39 min – 95th Percentile (5x better)
                        NFS – Triggered kernel panics in server
                                                  Siddhartha Annapureddy   Thesis Defense
Data pushed by server

                                               7400
                                         1.0



                  Normalized bandwidth   0.5

                                                         954

                                         0.0
                                               SFS     Shark

      Shark vs SFS – 23 copies vs 185 copies (8x better)
          Overlay not responsive enough
          Torn connections, timeouts force going to server
                 Siddhartha Annapureddy          Thesis Defense
Data served by Clients



                                140
                                                                                                  PlanetLab hosts
                                120                                                                      Shark, 40MB
   Bandwidth transmitted (MB)




                                100

                                80

                                60

                                40

                                20

                                 0
                                      0   20     40        60           80      100       120   140     160       180
                                                                 Unique Shark proxies



                                 Maximum contribution ≈ 3.5 copies
                                 Median contribution ≈ 0.75 copies

                                               Siddhartha Annapureddy        Thesis Defense
P2P systems – Themes




      Usability – Exploiting P2P for a range of systems
      Cross file(system) sharing – Use other groups
      Security issues in such wide-area P2P systems
      Topology management – Arrange nodes efficiently
      Scheduling dissemination of chunks efficiently




                  Siddhartha Annapureddy   Thesis Defense
Fetching Chunks – Order Matters



      Scheduling problem – What order to fetch chunks of file?
      Natural choices – Random or Sequential
    Intuitively, when many clients start simultaneously
       Random
           All clients fetch independent chunks
           More chunks become available in the cooperative cache
      Sequential
           Better disk I/O scheduling on the server
           Client that downloads most chunks alone adds to cache




                   Siddhartha Annapureddy   Thesis Defense
Emulab – 100 Nodes on LAN

                                     100
  Percentage completed within time




                                      80



                                      60



                                      40

                                                                                           40 MB read
                                                                                              Shark, rand
                                      20                                                      Shark, seq




                                      0
                                           0   100     200       300       400       500          600       700   800
                                                             Time since initialization (sec)

                                       Random – 133s
                                       Sequential – 203s
                                       Random Wins !!! – 35% better
                                                     Siddhartha Annapureddy      Thesis Defense
Performance characteristics of workloads


       For Shark, what matters is throughput




                                                             Utility →
           Operation should finish as soon as
           possible
                                                                         Throughput →




       Another interesting class of applications




                                                             Utility →
           Video streaming – RedCarpet
                                                                         Throughput →




       Study the scheduling problem for Video-on-Demand
           Press a button, wait a little while, and watch playback




                   Siddhartha Annapureddy   Thesis Defense
Challenges for VoD


      Near VoD
           Small setup time to play
           videos
      High sustainable goodput
           Largest slope osculating block
           arrival curve
           Highest video encoding rate
           system can support


     Goal: Do block scheduling to achieve low setup time and high
                              goodput



                   Siddhartha Annapureddy   Thesis Defense
System design – Outline


      Components based on BitTorrent
          Central server (tracker + seed)
          Users interested in the data
      Central server
          Maintains a list of nodes

      Each node finds neighbours
          Contacts the server for this
          Another option – Use gossip
          protocols
      Exchange data with neighbours
          Connections are bi-directional




                  Siddhartha Annapureddy   Thesis Defense
Simulator – Outline


      All nodes have unit bandwidth capacity
          500 nodes in most of our experiments
      Each node has 4-6 neighbours
      Simulator is round-based
          Block exchanges done at each round
          System moves as a whole to the next round
      File divided into 250 blocks
          Segment is a set of consecutive blocks
          Segment size = 10

      Maximum possible goodput – 1 block/round




                  Siddhartha Annapureddy   Thesis Defense
Why 95th percentile?

                                                                                    (x, y)
                                                                                         y nodes have a
           500
                                                                                         goodput at
           450
                                                                                         least x
           400

           350
                                                                                    Red line at top
           300
                                                                                    shows 95th
   Nodes




           250
                                                                                    percentile
           200

           150
                                                                                    95th percentile a
           100
                                                                                    good indicator of
            50
                                                                                    goodput
             0
                 0   0.1     0.2   0.3   0.4   0.5   0.6     0.7   0.8                   Most nodes
                           Goodput (in blocks/round)                                     have at least
                                                                                         that goodput


                                    Siddhartha Annapureddy         Thesis Defense
Feasibility of VoD – What does block/round mean?


      0.6 blocks/round with 35 rounds of setup time

      Two independent parameters
          1 round ≈ 10 seconds
          1 block/round ≈ 512 kbps

      Consider 35 rounds a good setup time
          35 rounds ≈ 6 min

      Goodput = 0.6 blocks/round
          Max encoding rate = 0.6 ∗ 512 > 300 kbps

      Size of video = 250/0.6 ≈ 70 min



                 Siddhartha Annapureddy   Thesis Defense
Na¨ scheduling policies
  ıve




      Segment-Random policy
          Divide file into segments
          Fetch blocks in random order inside same segment
          Download segments in order




                 Siddhartha Annapureddy   Thesis Defense
Na¨ approaches – Random
  ıve


                                                                                                    Each node fetches a
                            1
                                                                                                    block at random
                           0.9
                                                                                                    Throughput – High
  Goodput (blocks/round)




                           0.8

                           0.7
                                                                                                    as nodes fetch
                           0.6
                                                                                                    disjoint blocks
                           0.5                                                                             More opportunities
                           0.4                                                                             for block
                           0.3                                                                             exchanges
                           0.2

                           0.1                                                                      Goodput – Low as
                            0
                                 0   10   20   30   40     50   60   70   80   90   100
                                                                                                    nodes do not get
                                           Setup Time (in rounds)                                   blocks in order
                                                                                                           0 blocks/round



                                                         Siddhartha Annapureddy           Thesis Defense
Na¨ approaches – Sequential
  ıve

                             1

                            0.9
   Goodput (blocks/round)




                            0.8

                            0.7                                                             0.2 blocks/round
                            0.6

                            0.5

                            0.4

                            0.3
                                                                                            Sequential
                            0.2                                                             Random
                            0.1

                             0
                                  0   10    20   30   40   50   60   70   80   90   100
                                             Setup Time (in rounds)

                              Each node fetches blocks in order of playback
                              Throughput – Low as fewer opportunities for exchange
                                           Need to increase this


                                                      Siddhartha Annapureddy        Thesis Defense
Na¨ approaches – Segment-Random policy
  ıve

                             1

                            0.9
   Goodput (blocks/round)




                            0.8
                                                                                            0.4 blocks/round
                            0.7

                            0.6

                            0.5

                            0.4                                                             Segment-Random
                            0.3
                                                                                            Sequential
                            0.2
                                                                                            Random
                            0.1

                             0
                                  0   10    20   30   40   50   60   70   80   90   100
                                             Setup Time (in rounds)

                              Segment-Random
                                           Fetch random block from the segment the earliest block falls in
                              Increases the number of blocks propagating in the system


                                                      Siddhartha Annapureddy        Thesis Defense
NetCoding – How it helps?


                                                    A has blocks 1 and 2

                                                    B gets 1 or 2 with
                                                    equal prob. from A
                                                    C gets 1 in parallel

                                                    If B downloaded 1,
                                                    link B-C becomes
                                                    useless


      Network coding routinely sends 1 ⊕ 2




                 Siddhartha Annapureddy   Thesis Defense
Network coding – Mechanics

                                                      Coding over blocks
                                                      B1 , B2 , . . . , Bn

                                                      Choose coefficients
                                                      c1 , c2 , . . . , cn from finite
                                                      field

                                                      Generate encoded block:
                                                      E1 = n ci × Bi
                                                             i=1

         Without n coded blocks, decoding cannot be done
               Setup time at least the time to fetch a segment


   [Gkantsidis et al. Network Coding for Large Scale Content Distribution. InfoCom ’05]



                        Siddhartha Annapureddy   Thesis Defense
Benefits of network coding

                             1

                            0.9
   Goodput (blocks/round)



                            0.8

                            0.7

                            0.6
                                                                                             Segment-Netcoding
                            0.5                                                              Segment-Random
                            0.4                                                              Sequential
                            0.3                                                              Random
                            0.2

                            0.1

                             0
                                  0   10   20    30   40    50   60   70   80   90   100
                                                Setup Time (in rounds)
                             Goodput
                                      ≈ 0.6 blocks/round with netcoding
                                      0.4 blocks/round with seg-rand


                                                  Siddhartha Annapureddy    Thesis Defense
Segment scheduling



      Netcoding good for fetching blocks in a segment, but cannot
      be used across segments
          Because decoding requires all encoded blocks

      Problem: How to schedule fetching of segments?
          Until now, sequential segment scheduling
          Results in rare segments when nodes arrive at different times

      Segment scheduling algorithm to avoid rare segments
          Evaluation with C# prototype




                 Siddhartha Annapureddy   Thesis Defense
Segment scheduling

                                             A has 75% of the file
                                             Flash crowd of 20 Bs join
                                             with empty caches

                                             Server shared between A
                                             and Bs
                                                     Throughput to A
                                                     decreases, as server is
                                                     busy serving Bs

                                             Initial segments served
                                             repeatedly
                                                     Throughput of Bs low
                                                     because the same
                                                     segments are fetched
                                                     from server

               Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Problem




      8 segments in the video; A has 75% = 6 segments, Bs have
      none
      Green is popularity 2, Red is popularity 1
          Popularity = # full copies in the system
                 Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Problem




      Instead of serving A...




                  Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Problem




      The server gives segments to Bs




                 Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Problem

                   100

                    90

                    80

                    70

                    60
   # Data Blocks




                    50

                    40
                                                                                                   A
                    30
                                                                                                   Bs
                    20

                    10

                     0
                         0   25   50   75      100    125       150    175      200   225   250
                                                  Time (in sec)

                                              Naive -- #Blks Downloaded (A)
                                            Naive -- #Blks Downloaded (B1..n)


                     A’s goodput plummets
                     Get the best of both worlds – Improve A and Bs
                             Idea: Each node serves only rarest segments in system


                                        Siddhartha Annapureddy                    Thesis Defense
Segment scheduling – Algorithm




      When dst node connects to the src node...
          Here: dst is B, src is server



                  Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Algorithm




      src node sorts segments in order of popularity
          Segments 7, 8 least popular at 1
          Segments 1-6 equally popular at 2


                  Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Algorithm




      src considers segments in sorted order one-by-one and serves
      dst
          Either completely available at src, or
          First segment required by dst

                  Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Algorithm




      Server injects blocks from segment needed by A into system
          Avoids wasting bandwidth in serving initial segments multiple
          times


                 Siddhartha Annapureddy   Thesis Defense
Segment scheduling – Popularities




      How does src figure out popularities?
          Centrally available at the server
               Our implementation uses this technique
          Each node maintains popularities for segments
               Could use a gossiping protocol for aggregation




                  Siddhartha Annapureddy   Thesis Defense
Segment scheduling
                   100


                   90


                   80


                   70


                   60
   # Data Blocks




                   50


                   40


                   30


                   20
                                                                       Naive -- #Blks Downloaded (A)
                                                                     Naive -- #Blks Downloaded (B1..n)
                   10                                                            #Blks Downloaded (A)
                                                                              #Blks Downloaded (B1..n)

                    0
                         0   25   50       75      100        125        150     175       200       225   250
                                                         Time (in sec)

                    Note that goodput of both A and Bs improves
                                  Siddhartha Annapureddy       Thesis Defense
Conclusions



      Problem: Designing scalable data-intensive applications
      Enabled low-end servers to scale to large numbers of users
          Shark: Scales file server for read-heavy workloads
          RedCarpet: Scales video server for near-Video-on-Demand
      Exploited P2P for file systems, solved security issues
      Techniques to improve scalability
          Cross file(system) sharing
          Topology management – Data-based, locality awareness
          Scheduling of chunks – Network coding




                  Siddhartha Annapureddy   Thesis Defense

Weitere ähnliche Inhalte

Was ist angesagt?

9/28/11 Slides - Introduction to DuraCloud, Slides
9/28/11 Slides - Introduction to DuraCloud, Slides9/28/11 Slides - Introduction to DuraCloud, Slides
9/28/11 Slides - Introduction to DuraCloud, SlidesDuraSpace
 
Life without the Novell Client
Life without the Novell ClientLife without the Novell Client
Life without the Novell ClientNovell
 
Sh pro40 datasheet_se-sbs_2010_05
Sh pro40 datasheet_se-sbs_2010_05Sh pro40 datasheet_se-sbs_2010_05
Sh pro40 datasheet_se-sbs_2010_05hyperlyte3
 

Was ist angesagt? (6)

9/28/11 Slides - Introduction to DuraCloud, Slides
9/28/11 Slides - Introduction to DuraCloud, Slides9/28/11 Slides - Introduction to DuraCloud, Slides
9/28/11 Slides - Introduction to DuraCloud, Slides
 
1556 a 04
1556 a 041556 a 04
1556 a 04
 
Life without the Novell Client
Life without the Novell ClientLife without the Novell Client
Life without the Novell Client
 
1556 a 03
1556 a 031556 a 03
1556 a 03
 
Sh pro40 datasheet_se-sbs_2010_05
Sh pro40 datasheet_se-sbs_2010_05Sh pro40 datasheet_se-sbs_2010_05
Sh pro40 datasheet_se-sbs_2010_05
 
P5 cloud economics_v1
P5 cloud economics_v1P5 cloud economics_v1
P5 cloud economics_v1
 

Andere mochten auch

Hunting snake
Hunting snakeHunting snake
Hunting snakegabitaa8
 
BDO 2011 Biotech Briefing
BDO 2011 Biotech BriefingBDO 2011 Biotech Briefing
BDO 2011 Biotech Briefingrstarkes
 
Program Engagement Power. Programs Do Affect Ad Engagement
Program Engagement Power.  Programs Do Affect Ad EngagementProgram Engagement Power.  Programs Do Affect Ad Engagement
Program Engagement Power. Programs Do Affect Ad EngagementSteve Weaver
 
Skyebank TSP August 2011
Skyebank TSP August 2011Skyebank TSP August 2011
Skyebank TSP August 2011yemiolaitan
 
Game theory 11
Game theory 11Game theory 11
Game theory 11poundza
 
1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia
1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia  1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia
1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia Dr Aneel Kumar
 
80133823 backdor-nectcat-through-smb
80133823 backdor-nectcat-through-smb80133823 backdor-nectcat-through-smb
80133823 backdor-nectcat-through-smbjeweh
 
The woodspurge
The woodspurgeThe woodspurge
The woodspurgegabitaa8
 
พระราชบัญญัติ
พระราชบัญญัติพระราชบัญญัติ
พระราชบัญญัติhong11120
 
Game theory
Game theoryGame theory
Game theorypoundza
 
LGO Presentation (25.1.2012)
LGO Presentation (25.1.2012)LGO Presentation (25.1.2012)
LGO Presentation (25.1.2012)Michael Hill
 
Game theory
Game theoryGame theory
Game theorypoundza
 
Scala @ soundcloud [scaladores]
Scala @ soundcloud [scaladores]Scala @ soundcloud [scaladores]
Scala @ soundcloud [scaladores]Flavio W. Brasil
 
Leadership in Open Innovation
Leadership in Open InnovationLeadership in Open Innovation
Leadership in Open Innovationfwippich
 
26º domingo tob 2015 bene pagola
26º domingo tob 2015 bene pagola26º domingo tob 2015 bene pagola
26º domingo tob 2015 bene pagolanuria04
 
Front Page
Front PageFront Page
Front Pageagelso
 

Andere mochten auch (20)

Hunting snake
Hunting snakeHunting snake
Hunting snake
 
BDO 2011 Biotech Briefing
BDO 2011 Biotech BriefingBDO 2011 Biotech Briefing
BDO 2011 Biotech Briefing
 
Program Engagement Power. Programs Do Affect Ad Engagement
Program Engagement Power.  Programs Do Affect Ad EngagementProgram Engagement Power.  Programs Do Affect Ad Engagement
Program Engagement Power. Programs Do Affect Ad Engagement
 
The Space Corps Vision
The Space Corps VisionThe Space Corps Vision
The Space Corps Vision
 
Skyebank TSP August 2011
Skyebank TSP August 2011Skyebank TSP August 2011
Skyebank TSP August 2011
 
Game theory 11
Game theory 11Game theory 11
Game theory 11
 
1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia
1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia  1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia
1457894739192-Obsessive-Compulsive Symptoms in Schizophrenia
 
80133823 backdor-nectcat-through-smb
80133823 backdor-nectcat-through-smb80133823 backdor-nectcat-through-smb
80133823 backdor-nectcat-through-smb
 
The woodspurge
The woodspurgeThe woodspurge
The woodspurge
 
Art apr one
Art apr oneArt apr one
Art apr one
 
พระราชบัญญัติ
พระราชบัญญัติพระราชบัญญัติ
พระราชบัญญัติ
 
Game theory
Game theoryGame theory
Game theory
 
LGO Presentation (25.1.2012)
LGO Presentation (25.1.2012)LGO Presentation (25.1.2012)
LGO Presentation (25.1.2012)
 
Game theory
Game theoryGame theory
Game theory
 
Scala @ soundcloud [scaladores]
Scala @ soundcloud [scaladores]Scala @ soundcloud [scaladores]
Scala @ soundcloud [scaladores]
 
Leadership in Open Innovation
Leadership in Open InnovationLeadership in Open Innovation
Leadership in Open Innovation
 
26º domingo tob 2015 bene pagola
26º domingo tob 2015 bene pagola26º domingo tob 2015 bene pagola
26º domingo tob 2015 bene pagola
 
Portafolio 4to corte admon
Portafolio 4to corte admonPortafolio 4to corte admon
Portafolio 4to corte admon
 
Electronic voting-system
Electronic voting-systemElectronic voting-system
Electronic voting-system
 
Front Page
Front PageFront Page
Front Page
 

Ähnlich wie Scaling Data Servers via Cooperative Caching

Sun storage tek 2500 series disk array technical presentation
Sun storage tek 2500 series disk array technical presentationSun storage tek 2500 series disk array technical presentation
Sun storage tek 2500 series disk array technical presentationxKinAnx
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009lilyco
 
sector-sphere
sector-spheresector-sphere
sector-spherexlight
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends InnoTech
 
Communications Systems Research
Communications Systems ResearchCommunications Systems Research
Communications Systems ResearchPeter Lancaster
 
It's the End of Data Storage As We Know It (And I Feel Fine)
It's the End of Data Storage As We Know It (And I Feel Fine)It's the End of Data Storage As We Know It (And I Feel Fine)
It's the End of Data Storage As We Know It (And I Feel Fine)Stephen Foskett
 
The Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsThe Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsTony Pearson
 
Get started With Microsoft Azure Virtual Machine
Get started With Microsoft Azure Virtual MachineGet started With Microsoft Azure Virtual Machine
Get started With Microsoft Azure Virtual MachineLai Yoong Seng
 
FalconStor NSS Presentation
FalconStor NSS PresentationFalconStor NSS Presentation
FalconStor NSS Presentationrpsprowl
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Increasing Business Value Through High-Availability Technology
Increasing Business Value Through High-Availability TechnologyIncreasing Business Value Through High-Availability Technology
Increasing Business Value Through High-Availability TechnologyNexenta Systems
 
Extending the lifecycle of your storage area network
Extending the lifecycle of your storage area networkExtending the lifecycle of your storage area network
Extending the lifecycle of your storage area networkInterop
 
Extending On-Premise Infrastructure To Cloud
Extending On-Premise Infrastructure To CloudExtending On-Premise Infrastructure To Cloud
Extending On-Premise Infrastructure To CloudLai Yoong Seng
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithmDipak Badhe
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
 
SUN Network File system - Design, Implementation and Experience
SUN Network File system - Design, Implementation and Experience SUN Network File system - Design, Implementation and Experience
SUN Network File system - Design, Implementation and Experience aniadkar
 
BASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLS
BASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLSBASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLS
BASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLSamiteshg
 
Storage Virtualization Introduction
Storage Virtualization IntroductionStorage Virtualization Introduction
Storage Virtualization IntroductionStephen Foskett
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
Threading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyondThreading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyondMind The Firebird
 

Ähnlich wie Scaling Data Servers via Cooperative Caching (20)

Sun storage tek 2500 series disk array technical presentation
Sun storage tek 2500 series disk array technical presentationSun storage tek 2500 series disk array technical presentation
Sun storage tek 2500 series disk array technical presentation
 
Sector Sphere 2009
Sector Sphere 2009Sector Sphere 2009
Sector Sphere 2009
 
sector-sphere
sector-spheresector-sphere
sector-sphere
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends
 
Communications Systems Research
Communications Systems ResearchCommunications Systems Research
Communications Systems Research
 
It's the End of Data Storage As We Know It (And I Feel Fine)
It's the End of Data Storage As We Know It (And I Feel Fine)It's the End of Data Storage As We Know It (And I Feel Fine)
It's the End of Data Storage As We Know It (And I Feel Fine)
 
The Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsThe Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged Environments
 
Get started With Microsoft Azure Virtual Machine
Get started With Microsoft Azure Virtual MachineGet started With Microsoft Azure Virtual Machine
Get started With Microsoft Azure Virtual Machine
 
FalconStor NSS Presentation
FalconStor NSS PresentationFalconStor NSS Presentation
FalconStor NSS Presentation
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Increasing Business Value Through High-Availability Technology
Increasing Business Value Through High-Availability TechnologyIncreasing Business Value Through High-Availability Technology
Increasing Business Value Through High-Availability Technology
 
Extending the lifecycle of your storage area network
Extending the lifecycle of your storage area networkExtending the lifecycle of your storage area network
Extending the lifecycle of your storage area network
 
Extending On-Premise Infrastructure To Cloud
Extending On-Premise Infrastructure To CloudExtending On-Premise Infrastructure To Cloud
Extending On-Premise Infrastructure To Cloud
 
seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
SUN Network File system - Design, Implementation and Experience
SUN Network File system - Design, Implementation and Experience SUN Network File system - Design, Implementation and Experience
SUN Network File system - Design, Implementation and Experience
 
BASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLS
BASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLSBASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLS
BASIC OF ROUTERS,ROUTER IOS AND ROUTING PROTOCOLS
 
Storage Virtualization Introduction
Storage Virtualization IntroductionStorage Virtualization Introduction
Storage Virtualization Introduction
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
Threading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyondThreading through InterBase, Firebird, and beyond
Threading through InterBase, Firebird, and beyond
 

Scaling Data Servers via Cooperative Caching

  • 1. Scaling Data Servers via Cooperative Caching Siddhartha Annapureddy New York University Committee: David Mazi`res, Lakshminarayanan Subramanian, Jinyang Li, e Dennis Shasha, Zvi Kedem Siddhartha Annapureddy Thesis Defense
  • 2. Motivation New functionality over the Internet Large number of users consume many (large) files YouTube, Azureus etc. for movies – DVDs no more Data dissemination to large distributed farms Siddhartha Annapureddy Thesis Defense
  • 3. Current approaches Server Client Client Client Siddhartha Annapureddy Thesis Defense
  • 4. Current approaches Server Client Client Client Scalability ↓ – Server’s uplink saturates Operation takes long time to finish Siddhartha Annapureddy Thesis Defense
  • 5. Current approaches (contd.) Server farms replace central server Scalability problem persists Notoriously expensive to maintain Content Distribution Networks (CDNs) Still quite expensive Peer-to-peer (P2P) systems A new development Siddhartha Annapureddy Thesis Defense
  • 6. Peer-to-peer (P2P) systems Low-end central server Large client population File usually divided into chunks/blocks Use client resources to disseminate file Bandwidth, disk space etc. Clients organized into a connected overlay Each client knows only a few other clients How to design P2P systems to scale data servers? Siddhartha Annapureddy Thesis Defense
  • 7. P2P scalability – Design principles In order to achieve scalability, need to avoid hotspots Keep bw load on server to a minimum Server gives data only to few users Leverage participants to distribute to other users Maintenance overhead with users to be kept low Keep bw load on users to a minimum Each node interacts with only a small subset of users Siddhartha Annapureddy Thesis Defense
  • 8. P2P systems – Themes Usability – Exploiting P2P for a range of systems File systems, user-level apps etc. Security issues in such wide-area P2P systems Providing data integrity, privacy, authentication Cross file(system) sharing – Use other groups Make use of all available sources Topology management – Arrange nodes efficiently Underlying network (locality properties) Node capacities (bandwidth, disk space etc.) Application-specific (certain nodes together work well) Scheduling dissemination of chunks efficiently Nodes only have partial information Siddhartha Annapureddy Thesis Defense
  • 9. P2P systems – Themes Usability – Exploiting P2P for a range of systems File systems, user-level apps etc. Security issues in such wide-area P2P systems Providing data integrity, privacy, authentication Cross file(system) sharing – Use other groups Make use of all available sources Topology management – Arrange nodes efficiently Underlying network (locality properties) Node capacities (bandwidth, disk space etc.) Application-specific (certain nodes together work well) Scheduling dissemination of chunks efficiently Nodes only have partial information Siddhartha Annapureddy Thesis Defense
  • 10. P2P systems – Themes Usability – Exploiting P2P for a range of systems File systems, user-level apps etc. Security issues in such wide-area P2P systems Providing data integrity, privacy, authentication Cross file(system) sharing – Use other groups Make use of all available sources Topology management – Arrange nodes efficiently Underlying network (locality properties) Node capacities (bandwidth, disk space etc.) Application-specific (certain nodes together work well) Scheduling dissemination of chunks efficiently Nodes only have partial information Siddhartha Annapureddy Thesis Defense
  • 11. P2P systems – Themes Usability – Exploiting P2P for a range of systems File systems, user-level apps etc. Security issues in such wide-area P2P systems Providing data integrity, privacy, authentication Cross file(system) sharing – Use other groups Make use of all available sources Topology management – Arrange nodes efficiently Underlying network (locality properties) Node capacities (bandwidth, disk space etc.) Application-specific (certain nodes together work well) Scheduling dissemination of chunks efficiently Nodes only have partial information Siddhartha Annapureddy Thesis Defense
  • 12. Shark – Motivation Scenario – Want to test a new application on a hundred nodes Workstation Workstation Workstation Workstation Workstation Workstation Workstation Workstation Workstation Workstation Problem – Need to push software to all nodes and collect results Siddhartha Annapureddy Thesis Defense
  • 13. Current approaches Distribute from a central server Server Client Network A Network B Client Client rsync, unison, scp – Server’s network uplink saturates – Wastes bandwidth along bottleneck links Siddhartha Annapureddy Thesis Defense
  • 14. Current approaches File distribution mechanisms BitTorrent, Bullet + Scales by offloading burden on server – Client downloads from half-way across the world Siddhartha Annapureddy Thesis Defense
  • 15. Inherent problems with copying files Users have to decide a priori what to ship Ship too much – Waste bandwidth, takes long time Ship too little – Hassle to work in a poor environment Idle environments consume disk space Users are loath to cleanup ⇒ Redistribution Need low cost solution for refetching files Manual management of software versions Point /usr/local to a central server Illusion of having development environment Programs fetched transparently on demand Siddhartha Annapureddy Thesis Defense
  • 16. Networked file systems + Know how to deploy these systems + Know how to administer such systems + Simple accountability mechanisms Eg: NFS, AFS, SFS etc. Siddhartha Annapureddy Thesis Defense
  • 17. Problem: Scalability 800 700 Time to finish Read (sec) 600 500 400 40 MB read SFS 300 200 100 0 0 20 40 60 80 100 Unique nodes Very slow ≈ 775s Much better !!! ≈ 88s (9x better) Siddhartha Annapureddy Thesis Defense
  • 18. Problem: Scalability 800 700 Time to finish Read (sec) 600 500 400 40 MB read SFS Shark 300 200 100 0 0 20 40 60 80 100 Unique nodes Very slow ≈ 775s Much better !!! ≈ 88s (9x better) Siddhartha Annapureddy Thesis Defense
  • 19. Let’s design this new filesystem Client Server Vanilla SFS Central server model Siddhartha Annapureddy Thesis Defense
  • 20. Scalability – Client-side caching Client Server Cache Large client cache ` la AFS a Whole file caching Scales by avoiding redundant data transfers Leases to ensure NFS-style cache consistency Siddhartha Annapureddy Thesis Defense
  • 21. Scalability – Cooperative caching Client Client Client Server Client Client With multiple clients, must address bandwidth concerns Siddhartha Annapureddy Thesis Defense
  • 22. Scalability – Cooperative caching Clients fetch data from each other and offload burden from server Client Client Client Server Client Client Shark clients maintain distributed index Siddhartha Annapureddy Thesis Defense
  • 23. Scalability – Cooperative caching Clients fetch data from each other and offload burden from server Client Client Client Server Client Client Shark clients maintain distributed index Fetch a file from multiple other clients in parallel Siddhartha Annapureddy Thesis Defense
  • 24. Scalability – Cooperative caching Clients fetch data from each other and offload burden from server Client Client Client Server Client Client Shark clients maintain distributed index Fetch a file from multiple other clients in parallel Read-heavy workloads – Writes still go to server Siddhartha Annapureddy Thesis Defense
  • 25. P2P systems – Themes Usability – Exploiting P2P for a range of systems Cross file(system) sharing – Use other groups Security issues in such wide-area P2P systems Topology management – Arrange nodes efficiently Scheduling dissemination of chunks efficiently Siddhartha Annapureddy Thesis Defense
  • 26. Cross file(system) sharing – Application Linux distribution with LiveCDs Knoppix Slax Sharkcd Knoppix Slax Sharkcd Slax Sharkcd Knoppix Knoppix Mirror Sharkcd LiveCD – Run an entire OS without using hard disk But all your programs must fit on a CD-ROM Download dynamically from server but scalability problems Knoppix and Slax users form global cache – Relieve servers Siddhartha Annapureddy Thesis Defense
  • 27. Cross file(system) sharing Global cooperative cache regardless of origin servers Slax Knoppix Client Client Client Client Client Client Client Client Client Client Two groups of clients accessing Slax/Knoppix servers Client groups share a large amount of software Such clients automatically form a global cache Siddhartha Annapureddy Thesis Defense
  • 28. Cross file(system) sharing Global cooperative cache regardless of origin servers Slax Knoppix Client Client Client Client Client Client Client Client Client Client Two groups of clients accessing Slax/Knoppix servers Client groups share a large amount of software Such clients automatically form a global cache Siddhartha Annapureddy Thesis Defense
  • 29. Cross file(system) sharing Global cooperative cache regardless of origin servers Slax Knoppix Client Client Client Client Client Client Client Client Client Client Two groups of clients accessing Slax/Knoppix servers Client groups share a large amount of software BitTorrent supports only per-file groups Siddhartha Annapureddy Thesis Defense
  • 30. Content-based chunking Client Client Client Server Client Client Single overlay for all filesystems and content-based chunks Chunks – Variable-sized blocks based on content Chunks are better – Exploit commonalities across files LBFS-style chunks preserved across versions, concatenations [Muthitacharoen et al. A Low-Bandwidth Network File System. SOSP ’01] Siddhartha Annapureddy Thesis Defense
  • 31. P2P systems – Themes Usability – Exploiting P2P for a range of systems Cross file(system) sharing – Use other groups Security issues in such wide-area P2P systems Topology management – Arrange nodes efficiently Scheduling dissemination of chunks efficiently Siddhartha Annapureddy Thesis Defense
  • 32. Security issues – Client authentication Client Traditional Client Client Authentication Server Client Client Traditionally, server authenticated read requests using uids Siddhartha Annapureddy Thesis Defense
  • 33. Security issues – Client authentication Client Authenticator Derived Traditional Client From Token Client Authentication Server Client Client Traditionally, server authenticated read requests using uids Challenge – Interaction between distrustful clients Siddhartha Annapureddy Thesis Defense
  • 34. Security issues – Client communication Client should be able to check integrity of downloaded chunk Client should not send chunks to other unauthorized clients An eavesdropper shouldn’t be able to obtain chunk contents Accomplish this without a complex PKI Keep the number of rounds to a minimum Siddhartha Annapureddy Thesis Defense
  • 35. File metadata GET_TOKEN (fh,offset,count) Client Metadata: Server Chunk tokens Possession of token implies server permissions to read Tokens are a shared secret between authorized clients Tokens can be used to check integrity of fetched data Given a chunk B... Chunk token TB = H(B) H is a collision resistant hash function Siddhartha Annapureddy Thesis Defense
  • 36. File metadata GET_TOKEN (fh,offset,count) Client Metadata: Server Chunk tokens Possession of token implies server permissions to read Tokens are a shared secret between authorized clients Tokens can be used to check integrity of fetched data Given a chunk B... Chunk token TB = H(B) H is a collision resistant hash function Siddhartha Annapureddy Thesis Defense
  • 37. File metadata GET_TOKEN (fh,offset,count) Client Metadata: Server Chunk tokens Possession of token implies server permissions to read Tokens are a shared secret between authorized clients Tokens can be used to check integrity of fetched data Given a chunk B... Chunk token TB = H(B) H is a collision resistant hash function Siddhartha Annapureddy Thesis Defense
  • 38. File metadata GET_TOKEN (fh,offset,count) Client Metadata: Server Chunk tokens Possession of token implies server permissions to read Tokens are a shared secret between authorized clients Tokens can be used to check integrity of fetched data Given a chunk B... Chunk token TB = H(B) H is a collision resistant hash function Siddhartha Annapureddy Thesis Defense
  • 39. File metadata GET_TOKEN (fh,offset,count) Client Metadata: Server Chunk tokens Possession of token implies server permissions to read Tokens are a shared secret between authorized clients Tokens can be used to check integrity of fetched data Given a chunk B... Chunk token TB = H(B) H is a collision resistant hash function Siddhartha Annapureddy Thesis Defense
  • 40. Security protocol Receiver Source Client Client RC , RP – Random nonces to ensure freshness AuthC – Authenticator to prove receiver has token AuthC = MAC (TB , “Auth C”, C, P, RC , RP ) K – Key to encrypt chunk contents K = MAC (TB , “Encryption”, C, P, RC , RP ) Siddhartha Annapureddy Thesis Defense
  • 41. Security protocol RC RP Receiver Source Client Client RC , RP – Random nonces to ensure freshness AuthC – Authenticator to prove receiver has token AuthC = MAC (TB , “Auth C”, C, P, RC , RP ) K – Key to encrypt chunk contents K = MAC (TB , “Encryption”, C, P, RC , RP ) Siddhartha Annapureddy Thesis Defense
  • 42. Security protocol RC RP Receiver Source Client IB, AuthC Client EK(B) RC , RP – Random nonces to ensure freshness AuthC – Authenticator to prove receiver has token AuthC = MAC (TB , “Auth C”, C, P, RC , RP ) K – Key to encrypt chunk contents K = MAC (TB , “Encryption”, C, P, RC , RP ) Siddhartha Annapureddy Thesis Defense
  • 43. Security protocol RC RP Receiver Source Client IB, AuthC Client EK(B) RC , RP – Random nonces to ensure freshness AuthC – Authenticator to prove receiver has token AuthC = MAC (TB , “Auth C”, C, P, RC , RP ) K – Key to encrypt chunk contents K = MAC (TB , “Encryption”, C, P, RC , RP ) Siddhartha Annapureddy Thesis Defense
  • 44. Security properties RC RP Receiver Source Client IB, AuthC Client EK(B) Client can check integrity of downloaded chunk ? Client checks H(Downloaded chunk) = TB Source should not send chunks to unauthorized clients Malicious clients cannot send correct AuthC Eavesdropper shouldn’t get chunk contents All communication encrypted with K Siddhartha Annapureddy Thesis Defense
  • 45. P2P systems – Themes Usability – Exploiting P2P for a range of systems Cross file(system) sharing – Use other groups Security issues in such wide-area P2P systems Topology management – Arrange nodes efficiently Scheduling dissemination of chunks efficiently Siddhartha Annapureddy Thesis Defense
  • 46. Topology management Client Fetch Metadata Client Client Server Client Client Fetch metadata from the server Look up clients caching needed chunks in overlay Overlay is common to all file systems Overlay organized like a DHT – Get and Put Siddhartha Annapureddy Thesis Defense
  • 47. Topology management Client Client Lookup Clients Client Server Client Client Fetch metadata from the server Look up clients caching needed chunks in overlay Overlay is common to all file systems Overlay organized like a DHT – Get and Put Siddhartha Annapureddy Thesis Defense
  • 48. Topology management Client Client Lookup Clients Client Server Client Client Fetch metadata from the server Look up clients caching needed chunks in overlay Overlay is common to all file systems Overlay organized like a DHT – Get and Put Siddhartha Annapureddy Thesis Defense
  • 49. Topology management Client Client Lookup Clients Client Server Client Client Fetch metadata from the server Look up clients caching needed chunks in overlay Overlay is common to all file systems Overlay organized like a DHT – Get and Put Siddhartha Annapureddy Thesis Defense
  • 50. Overlay – Get operation Client Client Discover Clients Client Server Client Client For every chunk B, there’s indexing key IB IB used to index clients caching B Cannot set IB = TB , as TB is a secret IB = MAC (TB , “Indexing Key”) Siddhartha Annapureddy Thesis Defense
  • 51. Overlay – Put operation Register as a source Client now becomes a source for the downloaded chunk Client registers in distributed index – PUT(IB , Addr) Chunk Reconciliation Reuse TCP connection to download more chunks Exchange mutually needed chunks w/o indexing overhead Siddhartha Annapureddy Thesis Defense
  • 52. Overlay – Locality awareness Client Client Network A Network B Client Client Overlay organized as clusters based on latency Preferentially returns sources in same cluster as the client Hence, chunks usually transferred from nearby clients [Freedman et al. Democratizing content publication with Coral. NSDI ’04] Siddhartha Annapureddy Thesis Defense
  • 53. Topology management – Shark vs. BitTorrent BitTorrent Server Neighbours “fixed” Makes cross file sharing difficult Client Shark Client Chunk-based overlay Neighbours based on data Client Client Enables cross file sharing Siddhartha Annapureddy Thesis Defense
  • 54. Evaluation How does Shark compare with SFS? With NFS? How scalable is the server? How fair is Shark across clients? Which order is better? Random or Sequential Siddhartha Annapureddy Thesis Defense
  • 55. Emulab – 100 Nodes on LAN 100 Percentage completed within time 80 60 40 40 MB read Shark, rand, negotiation 20 NFS SFS 0 0 100 200 300 400 500 600 700 800 Time since initialization (sec) Shark – 88s SFS – 775s (≈ 9x better), NFS – 350s (≈ 4x better) SFS less fair because of TCP backoffs Siddhartha Annapureddy Thesis Defense
  • 56. PlanetLab – 185 Nodes 100 Percentage completed within time 80 60 40 20 40 MB read Shark SFS 0 0 500 1000 1500 2000 2500 Time since initialization (sec) Shark ≈ 7 min – 95th Percentile SFS ≈ 39 min – 95th Percentile (5x better) NFS – Triggered kernel panics in server Siddhartha Annapureddy Thesis Defense
  • 57. Data pushed by server 7400 1.0 Normalized bandwidth 0.5 954 0.0 SFS Shark Shark vs SFS – 23 copies vs 185 copies (8x better) Overlay not responsive enough Torn connections, timeouts force going to server Siddhartha Annapureddy Thesis Defense
  • 58. Data served by Clients 140 PlanetLab hosts 120 Shark, 40MB Bandwidth transmitted (MB) 100 80 60 40 20 0 0 20 40 60 80 100 120 140 160 180 Unique Shark proxies Maximum contribution ≈ 3.5 copies Median contribution ≈ 0.75 copies Siddhartha Annapureddy Thesis Defense
  • 59. P2P systems – Themes Usability – Exploiting P2P for a range of systems Cross file(system) sharing – Use other groups Security issues in such wide-area P2P systems Topology management – Arrange nodes efficiently Scheduling dissemination of chunks efficiently Siddhartha Annapureddy Thesis Defense
  • 60. Fetching Chunks – Order Matters Scheduling problem – What order to fetch chunks of file? Natural choices – Random or Sequential Intuitively, when many clients start simultaneously Random All clients fetch independent chunks More chunks become available in the cooperative cache Sequential Better disk I/O scheduling on the server Client that downloads most chunks alone adds to cache Siddhartha Annapureddy Thesis Defense
  • 61. Emulab – 100 Nodes on LAN 100 Percentage completed within time 80 60 40 40 MB read Shark, rand 20 Shark, seq 0 0 100 200 300 400 500 600 700 800 Time since initialization (sec) Random – 133s Sequential – 203s Random Wins !!! – 35% better Siddhartha Annapureddy Thesis Defense
  • 62. Performance characteristics of workloads For Shark, what matters is throughput Utility → Operation should finish as soon as possible Throughput → Another interesting class of applications Utility → Video streaming – RedCarpet Throughput → Study the scheduling problem for Video-on-Demand Press a button, wait a little while, and watch playback Siddhartha Annapureddy Thesis Defense
  • 63. Challenges for VoD Near VoD Small setup time to play videos High sustainable goodput Largest slope osculating block arrival curve Highest video encoding rate system can support Goal: Do block scheduling to achieve low setup time and high goodput Siddhartha Annapureddy Thesis Defense
  • 64. System design – Outline Components based on BitTorrent Central server (tracker + seed) Users interested in the data Central server Maintains a list of nodes Each node finds neighbours Contacts the server for this Another option – Use gossip protocols Exchange data with neighbours Connections are bi-directional Siddhartha Annapureddy Thesis Defense
  • 65. Simulator – Outline All nodes have unit bandwidth capacity 500 nodes in most of our experiments Each node has 4-6 neighbours Simulator is round-based Block exchanges done at each round System moves as a whole to the next round File divided into 250 blocks Segment is a set of consecutive blocks Segment size = 10 Maximum possible goodput – 1 block/round Siddhartha Annapureddy Thesis Defense
  • 66. Why 95th percentile? (x, y) y nodes have a 500 goodput at 450 least x 400 350 Red line at top 300 shows 95th Nodes 250 percentile 200 150 95th percentile a 100 good indicator of 50 goodput 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Most nodes Goodput (in blocks/round) have at least that goodput Siddhartha Annapureddy Thesis Defense
  • 67. Feasibility of VoD – What does block/round mean? 0.6 blocks/round with 35 rounds of setup time Two independent parameters 1 round ≈ 10 seconds 1 block/round ≈ 512 kbps Consider 35 rounds a good setup time 35 rounds ≈ 6 min Goodput = 0.6 blocks/round Max encoding rate = 0.6 ∗ 512 > 300 kbps Size of video = 250/0.6 ≈ 70 min Siddhartha Annapureddy Thesis Defense
  • 68. Na¨ scheduling policies ıve Segment-Random policy Divide file into segments Fetch blocks in random order inside same segment Download segments in order Siddhartha Annapureddy Thesis Defense
  • 69. Na¨ approaches – Random ıve Each node fetches a 1 block at random 0.9 Throughput – High Goodput (blocks/round) 0.8 0.7 as nodes fetch 0.6 disjoint blocks 0.5 More opportunities 0.4 for block 0.3 exchanges 0.2 0.1 Goodput – Low as 0 0 10 20 30 40 50 60 70 80 90 100 nodes do not get Setup Time (in rounds) blocks in order 0 blocks/round Siddhartha Annapureddy Thesis Defense
  • 70. Na¨ approaches – Sequential ıve 1 0.9 Goodput (blocks/round) 0.8 0.7 0.2 blocks/round 0.6 0.5 0.4 0.3 Sequential 0.2 Random 0.1 0 0 10 20 30 40 50 60 70 80 90 100 Setup Time (in rounds) Each node fetches blocks in order of playback Throughput – Low as fewer opportunities for exchange Need to increase this Siddhartha Annapureddy Thesis Defense
  • 71. Na¨ approaches – Segment-Random policy ıve 1 0.9 Goodput (blocks/round) 0.8 0.4 blocks/round 0.7 0.6 0.5 0.4 Segment-Random 0.3 Sequential 0.2 Random 0.1 0 0 10 20 30 40 50 60 70 80 90 100 Setup Time (in rounds) Segment-Random Fetch random block from the segment the earliest block falls in Increases the number of blocks propagating in the system Siddhartha Annapureddy Thesis Defense
  • 72. NetCoding – How it helps? A has blocks 1 and 2 B gets 1 or 2 with equal prob. from A C gets 1 in parallel If B downloaded 1, link B-C becomes useless Network coding routinely sends 1 ⊕ 2 Siddhartha Annapureddy Thesis Defense
  • 73. Network coding – Mechanics Coding over blocks B1 , B2 , . . . , Bn Choose coefficients c1 , c2 , . . . , cn from finite field Generate encoded block: E1 = n ci × Bi i=1 Without n coded blocks, decoding cannot be done Setup time at least the time to fetch a segment [Gkantsidis et al. Network Coding for Large Scale Content Distribution. InfoCom ’05] Siddhartha Annapureddy Thesis Defense
  • 74. Benefits of network coding 1 0.9 Goodput (blocks/round) 0.8 0.7 0.6 Segment-Netcoding 0.5 Segment-Random 0.4 Sequential 0.3 Random 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 Setup Time (in rounds) Goodput ≈ 0.6 blocks/round with netcoding 0.4 blocks/round with seg-rand Siddhartha Annapureddy Thesis Defense
  • 75. Segment scheduling Netcoding good for fetching blocks in a segment, but cannot be used across segments Because decoding requires all encoded blocks Problem: How to schedule fetching of segments? Until now, sequential segment scheduling Results in rare segments when nodes arrive at different times Segment scheduling algorithm to avoid rare segments Evaluation with C# prototype Siddhartha Annapureddy Thesis Defense
  • 76. Segment scheduling A has 75% of the file Flash crowd of 20 Bs join with empty caches Server shared between A and Bs Throughput to A decreases, as server is busy serving Bs Initial segments served repeatedly Throughput of Bs low because the same segments are fetched from server Siddhartha Annapureddy Thesis Defense
  • 77. Segment scheduling – Problem 8 segments in the video; A has 75% = 6 segments, Bs have none Green is popularity 2, Red is popularity 1 Popularity = # full copies in the system Siddhartha Annapureddy Thesis Defense
  • 78. Segment scheduling – Problem Instead of serving A... Siddhartha Annapureddy Thesis Defense
  • 79. Segment scheduling – Problem The server gives segments to Bs Siddhartha Annapureddy Thesis Defense
  • 80. Segment scheduling – Problem 100 90 80 70 60 # Data Blocks 50 40 A 30 Bs 20 10 0 0 25 50 75 100 125 150 175 200 225 250 Time (in sec) Naive -- #Blks Downloaded (A) Naive -- #Blks Downloaded (B1..n) A’s goodput plummets Get the best of both worlds – Improve A and Bs Idea: Each node serves only rarest segments in system Siddhartha Annapureddy Thesis Defense
  • 81. Segment scheduling – Algorithm When dst node connects to the src node... Here: dst is B, src is server Siddhartha Annapureddy Thesis Defense
  • 82. Segment scheduling – Algorithm src node sorts segments in order of popularity Segments 7, 8 least popular at 1 Segments 1-6 equally popular at 2 Siddhartha Annapureddy Thesis Defense
  • 83. Segment scheduling – Algorithm src considers segments in sorted order one-by-one and serves dst Either completely available at src, or First segment required by dst Siddhartha Annapureddy Thesis Defense
  • 84. Segment scheduling – Algorithm Server injects blocks from segment needed by A into system Avoids wasting bandwidth in serving initial segments multiple times Siddhartha Annapureddy Thesis Defense
  • 85. Segment scheduling – Popularities How does src figure out popularities? Centrally available at the server Our implementation uses this technique Each node maintains popularities for segments Could use a gossiping protocol for aggregation Siddhartha Annapureddy Thesis Defense
  • 86. Segment scheduling 100 90 80 70 60 # Data Blocks 50 40 30 20 Naive -- #Blks Downloaded (A) Naive -- #Blks Downloaded (B1..n) 10 #Blks Downloaded (A) #Blks Downloaded (B1..n) 0 0 25 50 75 100 125 150 175 200 225 250 Time (in sec) Note that goodput of both A and Bs improves Siddhartha Annapureddy Thesis Defense
  • 87. Conclusions Problem: Designing scalable data-intensive applications Enabled low-end servers to scale to large numbers of users Shark: Scales file server for read-heavy workloads RedCarpet: Scales video server for near-Video-on-Demand Exploited P2P for file systems, solved security issues Techniques to improve scalability Cross file(system) sharing Topology management – Data-based, locality awareness Scheduling of chunks – Network coding Siddhartha Annapureddy Thesis Defense