SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Downloaden Sie, um offline zu lesen
Thousands of Threads and Blocking I/O
   The old way to write Java Servers is New again
                    (and way better)




                    Paul Tyma
              paul.tyma@gmail.com


             paultyma.blogspot.com
Who am I

    ●   Day job: Senior Engineer, Google
    ●   Founder, Preemptive Solutions, Inc.
        –   Dasho, Dotfuscator
    ●   Founder/CEO ManyBrain, Inc.
        –   Mailinator
        –   empirical max rate seen ~1250 emails/sec
        –   extrapolates to 108 million emails/day
    ●   Ph.D., Computer Engineering, Syracuse
        University
                              2         C2008 Paul Tyma
About This Talk

    ●   Comparison of server models
        –   IO – synchronous
        –   NIO – asynchronous
    ●   Empirical results
    ●   Debunk myths of multithreading
        –   (and nio at times)
    ●   Every server is different
        –   compare 2 multithreaded server applications

                                 3     C2008 Paul Tyma
An argument over server
design

    ●   Synchronous I/O, single connection per
        thread model
        –   high concurrency
        –   many threads (thousands)
    ●   Asynchronous I/O, single thread per
        server!
        –   in practice, more threads often come into play
        –   Application handles context switching between
            clients

                                 4           C2008 Paul Tyma
Evolution

    ●   We started with simple threading
        modeled servers
        –   one thread per connection
        –   synchronization issues came to light quickly
        –   Turned out synchronization was hard
             ●   Pretty much, everyone got it wrong
             ●   Resulting in nice, intermittent, untraceable,
                 unreproducible, occasional server crashes
             ●   Turns out whether we admit it or not, “occasional”
                 is workable

                                    5           C2008 Paul Tyma
Evolution

    ●   The bigger problem was that scaling was limited
    ●   After a few hundred threads things
        started to break down
        –   i.e., few hundred clients
    ●   In comes java NIO
        –   a pre-existing model elsewhere (C++, linux,
            etc.)
    ●   Asynchronous I/O
        –   I/O becomes “event based”
                               6           C2008 Paul Tyma
Evolution - nio

    ●   The server goes about its business and is
        “notified” when some I/O event is ready
        to be processed
    ●   We must keep track of where each client
        is within a i/o transaction
        –   telephone: “For client XYZ, I just picked up
            the phone and said 'Hello'. I'm now waiting
            for a response.”
        –   In other words, we must explicitly save the
            state of each client
                               7         C2008 Paul Tyma
Evolution - nio

    ●   In theory, one thread for the entire server
        –   no synchronization
        –   no given task can monopolize anything
    ●   Rarely, if ever works that way in practice
        –   small pools of threads handle several stages
        –   multiple back-end communication threads
        –   worker threads
        –   DB threads

                              8         C2008 Paul Tyma
Evolution

    ●   SEDA
        –   Matt Welsh's Ph.D. thesis
        –   http://www.eecs.harvard.edu/~mdw/proj/seda
        –   Gold standard in server design
             ●   dynamic thread pool sizing at different tiers
             ●   somewhat of an evolutionary algorithm – try
                 another thread, see what happens
             ●   Build on Java NBIO (other NIO lib)



                                    9           C2008 Paul Tyma
Evolution

    ●   Asynchronous I/O
        –   Limited by CPU, bandwidth, File descriptors
            (not threads)
    ●   Common knowledge that NIO >> IO
        –   java.nio
        –   java.io
    ●   Somewhere along the line, someone got “scalable”
        and “fast” mixed up – and it stuck


                              10        C2008 Paul Tyma
NIO vs. IO
    (asynchronous vs. synchronous io)




                  11       C2008 Paul Tyma
An attempt to examine
both paradigms

    ●   If you were writing a server today, which
        would you choose?
        –   why?




                          12       C2008 Paul Tyma
Reasons I've heard to choose NIO
(note: all of these are up to debate)




    ●   Asynchronous I/O is faster
    ●   Thread context switching is slow
    ●   Threads take up too much memory
    ●   Synchronization among threads will kill
        you
    ●   Thread-per-connection does not scale



                          13         C2008 Paul Tyma
Reasons to choose thread-per-
connection IO
(again, maybe, maybe not, we'll see)



    ●   Synchronous I/O is faster
    ●   Coding is much simpler
        –   You code as if you only have one client at a
            time
    ●   Make better use of multi-core machines




                              14         C2008 Paul Tyma
All the reasons
(for reference)



 ●   nio: Asynchronous I/O is faster
 ●   io: Synchronous I/O is faster
 ●   nio: Thread context switching is slow
 ●   io: Coding is much simpler
 ●   nio: Threads take up too much memory
 ●   io: Make better use of multi-cores
 ●   nio: Synchronization among threads will
     kill you
 ●   nio: Thread-per-connection C2008 Paul Tyma scale
                       15        does not
NIO vs. IO
which is faster


    ●   Forget threading a moment
             ●   single sender, single receiver
    ●   For a tangential purpose I was
        benchmarking NIO and IO
        –   simple “blast data” benchmark
    ●   I could only get NIO to transfer data up to
        about 75% of IO's speed
        –   asynchronous (blocking NIO was just as fast)
        –   I blamed myself because we all know asynchronous
            is faster than synchronous right?
                                    16            C2008 Paul Tyma
NIO vs. IO

    ●   Started doing some emailing and
        googling looking for benchmarks,
        experiential reports
         –   Yes, everyone knew NIO was faster
         –   No, no one had actually personally tested it
    ●   Started formalizing my benchmark then found
    ●   http://www.theserverside.com/discussions/thread.tss?thread_id=2
    ●   http://www.realityinteractive.com/rgrzywinski/archives/000096.ht



                                  17          C2008 Paul Tyma
Excerpts

    ●   Blocking model was consistently 25-35% faster than using
        NIO selectors. Lot of techniques suggested by EmberIO folks
        were employed - using multiple selectors, doing multiple (2)
        reads if the first read returned EAGAIN equivalent in Java. Yet
        we couldn't beat the plain thread per connection model with
        Linux NPTL.
    ●   To work around not so performant/scalable poll()
        implementation on Linux's we tried using epoll with
        Blackwidow JVM on a 2.6.5 kernel. while epoll improved the
        over scalability, the performance still remained 25% below
        the vanilla thread per connection model. With epoll we
        needed lot fewer threads to get to the best performance
        mark that we could get out of NIO.
    Rahul Bhargava, CTO Rascal Systems
                                   18           C2008 Paul Tyma
Asynchronous
(simplified)


    1)Make system call to selector
    2)if nothing to do, goto 1
    3)loop through tasks
      a)if its an OP_ACCEPT, system call to accept
        the connection, save the key
      b)if its an OP_READ, find the key, system call to
       read the data
      c)if more tasks goto 3
    4)goto 1
                          19         C2008 Paul Tyma
Synchronous
(simplified)




    1)Make system call to accept connection
      (thread blocks there until we have one)
    2)Make system call to read data
      (thread blocks there until it gets some)
    3)goto 2




                          20         C2008 Paul Tyma
Straight Throughput
*do not compare charts against each other




           Linux 2.6 throughput                      Windows XP Throughput
    1000                                        80
                                                75
     900
                                                70
     800                                        65
                                                60
     700                                        55
                                                50
     600
                                                45
                                    MB/s                                         MB/s
     500                                        40
                                                35
     400                                        30
     300                                        25
                                                20
     200                                        15
                                                10
     100
                                                 5
      0                                          0
           nio server   io server                      nio server    io server




                                           21                C2008 Paul Tyma
All the reasons

●    nio: Asynchronous I/O is faster
●    io: Synchronous I/O is faster
●    nio: Thread context switching is slow
●    io: Coding is much simpler
●    nio: Threads take up too much memory
●    io: Make better use of multi-cores
●    nio: Synchronization among threads will
     kill you
 ●   nio: Thread-per-connection C2008 Paul Tyma scale
                       22        does not
Multithreading




         23      C2008 Paul Tyma
Multithreading

    ●   Hard
         –   You might be saying “nah, its not bad”
    ●   Let me rephrase
         –   Hard, because everyone thinks it isn't
    ●   http://today.java.net/pub/a/today/2007/06/28/extending-reentrantreadwritelock.html
    ●   http://www.ddj.com/java/199902669?pgno=3

    ●   Much like generics however, you can't avoid it
         –   good news is that the rewards are significant
         –   Inherently takes advantage of multi-core systems

                                           24               C2008 Paul Tyma
Multithreading

    ●   Linux 2.4 and before
        –   Threading was quite abysmal
        –   few hundred threads max
    ●   Windows actually quite good
    ●   up to 16000 threads
        –   jvm limitations



                              25   C2008 Paul Tyma
In walks NPTL

    ●   Threading library standard in Linux 2.6
        –   linux 2.4 by option
    ●   Idle thread cost is near zero
    ●   context-switching is much much faster
    ●   Possible to run many (many) threads
    ●   http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library




                               26          C2008 Paul Tyma
Thread context-switching is expensive

    ●             Lower lines are faster – JDK1.6, Core duo
    ●             Blue line represents up-to-1000 threads competing for the CPUs (core duo)
    ●             notice behavior between 1 and 2 threads
    ●             1 or 1000 all fighting for the CPU, context switching doesn't cost much at
                  all
                                      HashMap Gets ­ 0% Writes ­ Core Duo
                       2500
                       2250

                       2000

                       1750

                       1500
        milliseconds




                                                                                           HashMap
                       1250                                                                ConcHashMap
                                                                                           Cliff
                       1000                                                                Tricky
                        750

                        500
                        250

                         0
                              1   2           5                   100   500         1000

                                                  number of threads

                                                            27           C2008 Paul Tyma
Synchronization is
Expensive
     ●             With only one thread, uncontended synchronization is
                   cheap
     ●             Note how even just 2 threads magnifies synch cost
                                   HashMap Gets ­ O% Writes ­ Core Duo
                                             Smaller is Faster
                   11000
                   10500
                   10000
                    9500
                    9000
                    8500
                    8000
                    7500                                                                   HashMap
                    7000
                                                                                           SyncHashMap
                    6500
                                                                                           Hashtable
    milliseconds




                    6000
                    5500                                                                   ConcHashMap
                    5000                                                                   Cliff
                    4500                                                                   Tricky
                    4000
                    3500
                    3000
                    2500
                    2000
                    1500
                    1000
                           1   2       5                    100     500             1000
                                       number of threads


                                                  28              C2008 Paul Tyma
4 core Opteron
    ●                  Synchronization magnifies method call overhead
    ●                  more cores, more “sink” in line
    ●                  non-blocking data structures do very well - we don't always need
                       explicit synchronization

                                               HashMap Gets ­ 0% writes
                                                         Smaller is faster
                       7000
                       6500
                       6000
                       5500
                       5000
                                                                                                       HashMap
                       4500
                                                                                                       SyncHashMap
    MS to completion




                       4000                                                                            Hashtable
                       3500                                                                            ConcHashMap
                                                                                                       Cliff
                       3000
                                                                                                       Tricky
                       2500
                       2000
                       1500
                       1000
                        500
                              1     2          5                        100     500             1000

                                                              29
                                                   number of threads          C2008 Paul Tyma
How about a lot more cores?
guess: how many cores standard in 3 years?



                    Azul 768core ­ 0% writes
    22500

    20000

    17500

    15000                                                   HashMap
                                                            SyncHashMap
    12500                                                   Hashtable
                                                            ConcHashMap
    10000                                                   Cliff
                                                            Tricky
    7500

    5000

    2500

       0
            1   2       5         100   500        1000


                             30           C2008 Paul Tyma
Threading Summary

    ●   Uncontended Synchronization is cheap
        –   and can often be free
    ●   Contended Synch gets more expensive
    ●   Nonblocking datastructures scale well
    ●   Multithreaded programming style is quite
        viable even on single core



                              31    C2008 Paul Tyma
All the reasons

    ●   io: Synchronous I/O is faster
    ●   nio: Thread context switching is slow
    ●   io: Coding is much simpler
    ●   nio: Threads take up too much memory
    ●   io: Make better use of multi-cores
    ●   nio: Synchronization among threads will
        kill you
    ●   nio: Thread-per-connection does not scale
                          32         C2008 Paul Tyma
Many possible NIO server
configurations

    ●   True single thread
        –   Does selects
        –   reads
        –   writes
        –   and backend data retrieval/writing
             ●   from/to db
             ●   from/to disk
             ●   from/to cache



                                 33     C2008 Paul Tyma
Many possible NIO server
configurations

    ●   True single thread
        –   This doesn't take advantage of multicores
        –   Usually just one selector
    ●   Usually use a threadpool to do backend work
    ●   NIO must lock to give writes back to net thread
    ●   Interview question:
        –   What's harder, synchronizing 2 threads or
            synchronizing 1000 threads?

                              34        C2008 Paul Tyma
All the reasons

    ●   io: Synchronous I/O is faster
    ●   io: Coding is much simpler
    ●   nio: Threads take up too much memory


    CHOOSE
    ●   io: Make better use of multi-cores
    ●   nio: Synchronization among threads will
        kill you
                          35         C2008 Paul Tyma
The story of Rob Van Behren
(as remembered by me from a lunch with Rob)


    ●   Set out to write a high-performance asynchronous
        server system
    ●   Found that when switching between clients, the code
        for saving and restoring values/state was difficult
    ●   Took a step back and wrote a finely-tuned, organized
        system for saving and restoring state between clients
    ●   When he was done, he sat back and realized he had
        written the foundation for a threading package




                                36         C2008 Paul Tyma
Synchronous I/O
state is kept in control flow

  // exceptions and such left out
  InputStream inputStream = socket.getInputStream();
  OutputStream outputStream = socket.getOutputStream();

  String command = null;
  do {    
   command = inputStream.readUTF();
   } while ((!command.equals(“HELO”) && sendError());

    do {
     command = inputStream.readUTF();
    } while ((!command.startsWith(“MAIL FROM:”) && sendError());
    handleMailFrom(command);

    do {
     command = inputStream.readUTF();
    } while ((!command.startsWith(“RCPT TO:”) && sendError());
    handleRcptTo(command);

  do {
   command = inputStream.readUTF();
  } while ((!command.equals(“DATA”) && sendError());
  handleData();




                                         37                C2008 Paul Tyma
Server in Action




          38     C2008 Paul Tyma
Our 2 server designs
(synchronous multithreaded)


    ●   One thread per connection
        –   All code is written as if your server can only handle one
            connection
        –   And all data structures can be manipulated by many other
            threads
        –   Synchronization can be tricky
    ●   Reads are blocking
        –   Thus when waiting for a read, a thread is completely idle
        –   Writing is blocking too, but is not typically significant

    ●
        Hey, what about scaling?
                                     39            C2008 Paul Tyma
Synchronous
Multithreaded

    ●   Mailinator server:
        –   Quad opteron, 2GB of ram, 100Mbps
            ethernet, linux 2.6, java 1.6
        –   Runs both HTTP and SMTP servers
    ●   Quiz:
        –   How many threads can a computer run?


        –   How many threads should a computer run?

                             40      C2008 Paul Tyma
Synchronous
Multithreaded

    ●   Mailinator server:
        –   Quad opteron, 2GB of ram, 100Mbps
            ethernet, linux 2.6, java 1.6
    ●   How many threads can a computer run?
        –   Each thread in java x32 requires 48k of stack
            space
             ●   linux java limitation, not OS (windows = 2k?)
        –   option -Xss:48k (512k by default)
        –   ~41666 stack frames
        –   not counting room for OS, java, other data,
                                        C2008 Paul Tyma
                             41
            etc
Synchronous
Multithreaded

    ●   Mailinator server:
        –   Quad opteron, 2GB of ram, 100Mbps
            ethernet, linux 2.6, java 1.6
    ●   How many threads should a computer run?
        –   Similar to asking, how much should you eat
            for dinner?
             ●   usual answer: until you've had enough
             ●   usual answer: until you're full
             ●   fair answer: until you're sick
             ●   odd answser: until you're dead
                                    42             C2008 Paul Tyma
Synchronous
Multithreaded

    ●   Mailinator server:
        –   Quad opteron, 2GB of ram, 100Mbps ethernet,
            linux 2.6, java 1.6
    ●   How many threads should a computer
        run?
        –   “Just enough to max out your CPU(s)”
             ●   replace “CPU” with any other resource
        –   How many threads should you run for 100% cpu bound tasks?
             ●   maybe #ofCores+1 or so



                                      43            C2008 Paul Tyma
Synchronous
Multithreaded

    ●   Note that servers must pick a saturation point
    ●   Thats either CPU or network or memory or
        something
    ●   Typically serving a lot of users well is better than
        serving a lot more very poorly (or not at all)
    ●   You often need “push back” such that too much
        traffic comes all the way back to the front




                               44          C2008 Paul Tyma
Acceptor Threads                    Task Queue




                   put                                             fetch



                               Worker Threads




how big should the queue be?
should it be unbounded?
how many worker threads?
what happens if it fills?
                                      45         C2008 Paul Tyma
Design Decisions




          46      C2008 Paul Tyma
ThreadPools

    ●   Executors.newCachedThreadPool = evil, die die die
    ●   Tasks goto SynchronousQueue
        –   i.e., direct hand off from task-giver to new
            thread
    ●   unused threads eventually die (default: 60sec)
    ●   new threads are created if all existing are busy
        –   but only up to MAX_INT threads (snicker)
    ●   Scenario: CPU is pegged with work so threads aren't
        finishing fast, more tasks arrive, more threads are
        created to handle new work
        –   when CPU is pegged, more threads is not what you need
                                  47          C2008 Paul Tyma
Blocking Data Structures

    ●   BlockingQueue
        –   canonical “hand-off” structure
        –   embedded within Executors
        –   Rarely want LinkedBlockingQueue
             ●   i.e. more commonly use ArrayBlockingQueue
    ●   removals can be blocking
    ●   insertions can wake-up sleeping threads
    ●   In IO we can hand the worker threads the
        socket
                                  48        C2008 Paul Tyma
NonBlocking Data Structures

    ●   ConcurrentLinkedQueue
        –   concurrent linkedlist based on CAS
        –   elegance is downright fun
        –   No data corruption or blocking happens
            regardless of the number of threads adding
            or deleting or iterating
        –   Note that iteration is “fuzzy”




                               49            C2008 Paul Tyma
NonBlocking Data Structures

    ●   ConcurrentHashMap
        –   Not quite as concurrent
        –   non-blocking reads
        –   stripe-locked writes
        –   Can increase parallelism in constructor
    ●   Cliff Click's NonBlockingHashMap
        –   Fully non-blocking
        –   Does surprisingly well with many writes
        –   Also has NonBlockingLongHashMap (thanks Cliff!)

                                   50        C2008 Paul Tyma
BufferedStreams

    ●   BufferedOutputStream
        –   simply creates an 8k buffer
        –   requires flushing
        –   can immensely improve performance if you
            keep sending small sets of bytes
        –   The native call to do a send of a byte wraps it
            in an array and then sends that
        –   Look at BufferedStreams like adding a
            memory copy between your code and the
            send
                                51        C2008 Paul Tyma
BufferedStreams

    ●   BufferedOutputStream
        –   If your sends are broken up, use a buffered
            stream
        –   if you already package your whole message
            into a byte array, don't buffer
             ●   i.e., you already did
        –   Default buffer in BufferedOutputStream is 8k
            – what if your message is bigger?



                                    52   C2008 Paul Tyma
Bytes

    ●   Java programmers seem to have a fascination with
        Strings
        –   fair enough, they are a nice human abstraction
    ●   Can you keep your server's data solely in bytes?
    ●   Object allocation is cheap but a few million objects are
        probably measurable
    ●   String manipulation is cumbersome internally
    ●   If you can stay with byte arrays, it won't hurt
    ●   Careful with autoboxing too


                                        53             C2008 Paul Tyma
Papers
(indirect references)


    ●   “Why Events are a Bad Idea (For high-
        concurrency servers)”, Rob von Behren
    ●   Dan Kegel's C10K problem
        –   somewhat dated now but still great depth




                             54        C2008 Paul Tyma
Designing Servers




          55       C2008 Paul Tyma
A static Web Server

    ●   HTTP is stateless
        –   very nice, one request, one response
        –   Many clients, many, short connections
        –   Interestingly for normal GET operations (ajax
            too) we only block as they call in. After that,
            they are blocked waiting for our reply
        –   Threads must cooperate on cache




                               56        C2008 Paul Tyma
Thread per request

                                              cache

                       Worker
                        Worker
                          Worker
                                               disk




              Client

     ●
      Create a fixed number of threads in the beginning
       ●
           (not in a thread pool)
     ●
      All threads block at accept(), then go process the client
 
     ●
      Watch socket timeouts       57               C2008 Paul Tyma
     ●
      Catch all exceptions
Thread per request


                                                                   cache
    Acceptor
                     create new Thread        Worker
                                               Worker
                                                 Worker
                                                                    disk




            Client                   Client

        ●
         Thread creation cost is historically expensive
        ●
         Load tempered by keeping track of the number of threads alive
 
          ●
            not terribly precise    58                 C2008 Paul Tyma
Thread per request


                                                                             cache
          Acceptor
                                                    Worker
                              Queue                  Worker
                                                       Worker
                                                                             disk




                  Client                   Client
●
 Executors provide many knobs for thread tuning
●
 Handle dead threads
●
 Queue size can indicate load (to a degree)
●
 Acceptor threads can help out easily
                                            59             C2008 Paul Tyma
A static Web Server

    ●   Given HTTP is request-reply
        –   The longer transactions take, the more
            threads you need (because more are waiting
            on the backend)
        –   For a smartly cached static webserver, 2 core
            machine, 15-20 threads saturated the system
        –   Add a database backend or other complex
            processing per transaction and number of
            threads linearly increases


                              60        C2008 Paul Tyma
SMTP Server
    a much more chatty protocol




               61          C2008 Paul Tyma
An SMTP server

    ●   Very conversational protocol
    ●   Server spends a lot of time waiting for
        the next command (like many
        milliseconds)
    ●   Many threads simply asleep
    ●   Few hundred threads very common
    ●   Few thousand not uncommon
        –   (JVM max allow 32k)

                             62    C2008 Paul Tyma
An SMTP server

    ●   All threads must cooperate on storage of
        messages
        –   (keep in mind the concurrent data
            structures!)
    ●   Possible to saturate the bandwidth after a
        few thousand threads
        –   or the disk




                             63        C2008 Paul Tyma
Coding

    ●   Multithreaded server coding is more
        intuitive
        –   You simply follow the flow of whats going to
            happen to one client
    ●   Non-blocking data structures are very fast
    ●   Immutable data is fast
    ●   Sticking with bytes-in, bytes-out is nice
        –   might need more utility methods

                              64        C2008 Paul Tyma
Questions?




       65        C2008 Paul Tyma

Weitere ähnliche Inhalte

Was ist angesagt?

SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016DataStax
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides Altinity Ltd
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化confluent
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesYoshinori Matsunobu
 
Distributed Locking in Kubernetes
Distributed Locking in KubernetesDistributed Locking in Kubernetes
Distributed Locking in KubernetesRafał Leszko
 

Was ist angesagt? (20)

SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
kafka
kafkakafka
kafka
 
Distributed Locking in Kubernetes
Distributed Locking in KubernetesDistributed Locking in Kubernetes
Distributed Locking in Kubernetes
 

Andere mochten auch

Introducing the Java NIO.2
Introducing the Java NIO.2Introducing the Java NIO.2
Introducing the Java NIO.2Fadel Adoe
 
Apache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip HanikApache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip HanikEdgar Espina
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systemsmozgkarakaya
 
Tomcat Optimisation & Performance Tuning
Tomcat Optimisation & Performance TuningTomcat Optimisation & Performance Tuning
Tomcat Optimisation & Performance Tuninglovingprince58
 
Async IO and Multithreading explained
Async IO and Multithreading explainedAsync IO and Multithreading explained
Async IO and Multithreading explainedDirecti Group
 
Directi Campus Recruitment (BizDev)
Directi Campus Recruitment (BizDev)Directi Campus Recruitment (BizDev)
Directi Campus Recruitment (BizDev)Directi Group
 
Intro to node and non blocking io
Intro to node and non blocking ioIntro to node and non blocking io
Intro to node and non blocking ioAmy Hua
 
Non blocking io with netty
Non blocking io with nettyNon blocking io with netty
Non blocking io with nettyZauber
 
Asynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with NettyAsynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with NettyErsin Er
 
Netty from the trenches
Netty from the trenchesNetty from the trenches
Netty from the trenchesJordi Gerona
 
Netty - a pragmatic introduction
Netty - a pragmatic introductionNetty - a pragmatic introduction
Netty - a pragmatic introductionRaphael Stary
 

Andere mochten auch (12)

Introducing the Java NIO.2
Introducing the Java NIO.2Introducing the Java NIO.2
Introducing the Java NIO.2
 
Apache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip HanikApache Tomcat 7 by Filip Hanik
Apache Tomcat 7 by Filip Hanik
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Tomcat Optimisation & Performance Tuning
Tomcat Optimisation & Performance TuningTomcat Optimisation & Performance Tuning
Tomcat Optimisation & Performance Tuning
 
Async IO and Multithreading explained
Async IO and Multithreading explainedAsync IO and Multithreading explained
Async IO and Multithreading explained
 
Directi Campus Recruitment (BizDev)
Directi Campus Recruitment (BizDev)Directi Campus Recruitment (BizDev)
Directi Campus Recruitment (BizDev)
 
Intro to node and non blocking io
Intro to node and non blocking ioIntro to node and non blocking io
Intro to node and non blocking io
 
Non blocking io with netty
Non blocking io with nettyNon blocking io with netty
Non blocking io with netty
 
Asynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with NettyAsynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with Netty
 
Tomcatx performance-tuning
Tomcatx performance-tuningTomcatx performance-tuning
Tomcatx performance-tuning
 
Netty from the trenches
Netty from the trenchesNetty from the trenches
Netty from the trenches
 
Netty - a pragmatic introduction
Netty - a pragmatic introductionNetty - a pragmatic introduction
Netty - a pragmatic introduction
 

Ähnlich wie Thousands of Threads and Blocking I/O

2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on CephCeph Community
 
Deep Dive into Node.js Event Loop.pdf
Deep Dive into Node.js Event Loop.pdfDeep Dive into Node.js Event Loop.pdf
Deep Dive into Node.js Event Loop.pdfShubhamChaurasia88
 
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...Ernie Souhrada
 
Internet of Threads (IoTh), di Renzo Davoli (VirtualSquare)
Internet of Threads (IoTh), di  Renzo Davoli (VirtualSquare)  Internet of Threads (IoTh), di  Renzo Davoli (VirtualSquare)
Internet of Threads (IoTh), di Renzo Davoli (VirtualSquare) Codemotion
 
Building a Small DC
Building a Small DCBuilding a Small DC
Building a Small DCAPNIC
 
Building a Small Datacenter
Building a Small DatacenterBuilding a Small Datacenter
Building a Small Datacenterssuser4b98f0
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Jakarta_EE
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019The Eclipse Foundation
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchYutaka Yasuda
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasifthuguk
 
Istio (service mesh) why and how
Istio (service mesh) why and howIstio (service mesh) why and how
Istio (service mesh) why and howMilan Das
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
Road to sbt 1.0 paved with server
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with serverEugene Yokota
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...ScyllaDB
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futureTakayuki Muranushi
 

Ähnlich wie Thousands of Threads and Blocking I/O (20)

2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph2016-JAN-28 -- High Performance Production Databases on Ceph
2016-JAN-28 -- High Performance Production Databases on Ceph
 
Deep Dive into Node.js Event Loop.pdf
Deep Dive into Node.js Event Loop.pdfDeep Dive into Node.js Event Loop.pdf
Deep Dive into Node.js Event Loop.pdf
 
Stackato v3
Stackato v3Stackato v3
Stackato v3
 
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
All Your IOPS Are Belong To Us - A Pinteresting Case Study in MySQL Performan...
 
Internet of Threads (IoTh), di Renzo Davoli (VirtualSquare)
Internet of Threads (IoTh), di  Renzo Davoli (VirtualSquare)  Internet of Threads (IoTh), di  Renzo Davoli (VirtualSquare)
Internet of Threads (IoTh), di Renzo Davoli (VirtualSquare)
 
Building a Small DC
Building a Small DCBuilding a Small DC
Building a Small DC
 
Breaking down a monolith
Breaking down a monolithBreaking down a monolith
Breaking down a monolith
 
Building a Small Datacenter
Building a Small DatacenterBuilding a Small Datacenter
Building a Small Datacenter
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
 
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
Kubernetes Native Java and Eclipse MicroProfile | EclipseCon Europe 2019
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow Switch
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
 
Hadoop Networking at Datasift
Hadoop Networking at DatasiftHadoop Networking at Datasift
Hadoop Networking at Datasift
 
Istio (service mesh) why and how
Istio (service mesh) why and howIstio (service mesh) why and how
Istio (service mesh) why and how
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
 
Road to sbt 1.0 paved with server
Road to sbt 1.0   paved with serverRoad to sbt 1.0   paved with server
Road to sbt 1.0 paved with server
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 

Thousands of Threads and Blocking I/O

  • 1. Thousands of Threads and Blocking I/O The old way to write Java Servers is New again (and way better) Paul Tyma paul.tyma@gmail.com paultyma.blogspot.com
  • 2. Who am I ● Day job: Senior Engineer, Google ● Founder, Preemptive Solutions, Inc. – Dasho, Dotfuscator ● Founder/CEO ManyBrain, Inc. – Mailinator – empirical max rate seen ~1250 emails/sec – extrapolates to 108 million emails/day ● Ph.D., Computer Engineering, Syracuse University   2 C2008 Paul Tyma
  • 3. About This Talk ● Comparison of server models – IO – synchronous – NIO – asynchronous ● Empirical results ● Debunk myths of multithreading – (and nio at times) ● Every server is different – compare 2 multithreaded server applications   3 C2008 Paul Tyma
  • 4. An argument over server design ● Synchronous I/O, single connection per thread model – high concurrency – many threads (thousands) ● Asynchronous I/O, single thread per server! – in practice, more threads often come into play – Application handles context switching between clients   4 C2008 Paul Tyma
  • 5. Evolution ● We started with simple threading modeled servers – one thread per connection – synchronization issues came to light quickly – Turned out synchronization was hard ● Pretty much, everyone got it wrong ● Resulting in nice, intermittent, untraceable, unreproducible, occasional server crashes ● Turns out whether we admit it or not, “occasional” is workable   5 C2008 Paul Tyma
  • 6. Evolution ● The bigger problem was that scaling was limited ● After a few hundred threads things started to break down – i.e., few hundred clients ● In comes java NIO – a pre-existing model elsewhere (C++, linux, etc.) ● Asynchronous I/O – I/O becomes “event based”   6 C2008 Paul Tyma
  • 7. Evolution - nio ● The server goes about its business and is “notified” when some I/O event is ready to be processed ● We must keep track of where each client is within a i/o transaction – telephone: “For client XYZ, I just picked up the phone and said 'Hello'. I'm now waiting for a response.” – In other words, we must explicitly save the state of each client   7 C2008 Paul Tyma
  • 8. Evolution - nio ● In theory, one thread for the entire server – no synchronization – no given task can monopolize anything ● Rarely, if ever works that way in practice – small pools of threads handle several stages – multiple back-end communication threads – worker threads – DB threads   8 C2008 Paul Tyma
  • 9. Evolution ● SEDA – Matt Welsh's Ph.D. thesis – http://www.eecs.harvard.edu/~mdw/proj/seda – Gold standard in server design ● dynamic thread pool sizing at different tiers ● somewhat of an evolutionary algorithm – try another thread, see what happens ● Build on Java NBIO (other NIO lib)   9 C2008 Paul Tyma
  • 10. Evolution ● Asynchronous I/O – Limited by CPU, bandwidth, File descriptors (not threads) ● Common knowledge that NIO >> IO – java.nio – java.io ● Somewhere along the line, someone got “scalable” and “fast” mixed up – and it stuck   10 C2008 Paul Tyma
  • 11. NIO vs. IO (asynchronous vs. synchronous io)   11 C2008 Paul Tyma
  • 12. An attempt to examine both paradigms ● If you were writing a server today, which would you choose? – why?   12 C2008 Paul Tyma
  • 13. Reasons I've heard to choose NIO (note: all of these are up to debate) ● Asynchronous I/O is faster ● Thread context switching is slow ● Threads take up too much memory ● Synchronization among threads will kill you ● Thread-per-connection does not scale   13 C2008 Paul Tyma
  • 14. Reasons to choose thread-per- connection IO (again, maybe, maybe not, we'll see) ● Synchronous I/O is faster ● Coding is much simpler – You code as if you only have one client at a time ● Make better use of multi-core machines   14 C2008 Paul Tyma
  • 15. All the reasons (for reference) ● nio: Asynchronous I/O is faster ● io: Synchronous I/O is faster ● nio: Thread context switching is slow ● io: Coding is much simpler ● nio: Threads take up too much memory ● io: Make better use of multi-cores ● nio: Synchronization among threads will kill you  ● nio: Thread-per-connection C2008 Paul Tyma scale 15 does not
  • 16. NIO vs. IO which is faster ● Forget threading a moment ● single sender, single receiver ● For a tangential purpose I was benchmarking NIO and IO – simple “blast data” benchmark ● I could only get NIO to transfer data up to about 75% of IO's speed – asynchronous (blocking NIO was just as fast) – I blamed myself because we all know asynchronous is faster than synchronous right?   16 C2008 Paul Tyma
  • 17. NIO vs. IO ● Started doing some emailing and googling looking for benchmarks, experiential reports – Yes, everyone knew NIO was faster – No, no one had actually personally tested it ● Started formalizing my benchmark then found ● http://www.theserverside.com/discussions/thread.tss?thread_id=2 ● http://www.realityinteractive.com/rgrzywinski/archives/000096.ht   17 C2008 Paul Tyma
  • 18. Excerpts ● Blocking model was consistently 25-35% faster than using NIO selectors. Lot of techniques suggested by EmberIO folks were employed - using multiple selectors, doing multiple (2) reads if the first read returned EAGAIN equivalent in Java. Yet we couldn't beat the plain thread per connection model with Linux NPTL. ● To work around not so performant/scalable poll() implementation on Linux's we tried using epoll with Blackwidow JVM on a 2.6.5 kernel. while epoll improved the over scalability, the performance still remained 25% below the vanilla thread per connection model. With epoll we needed lot fewer threads to get to the best performance mark that we could get out of NIO. Rahul Bhargava, CTO Rascal Systems   18 C2008 Paul Tyma
  • 19. Asynchronous (simplified) 1)Make system call to selector 2)if nothing to do, goto 1 3)loop through tasks a)if its an OP_ACCEPT, system call to accept the connection, save the key b)if its an OP_READ, find the key, system call to read the data c)if more tasks goto 3 4)goto 1   19 C2008 Paul Tyma
  • 20. Synchronous (simplified) 1)Make system call to accept connection (thread blocks there until we have one) 2)Make system call to read data (thread blocks there until it gets some) 3)goto 2   20 C2008 Paul Tyma
  • 21. Straight Throughput *do not compare charts against each other Linux 2.6 throughput Windows XP Throughput 1000 80 75 900 70 800 65 60 700 55 50 600 45 MB/s MB/s 500 40 35 400 30 300 25 20 200 15 10 100 5 0 0 nio server io server nio server io server   21 C2008 Paul Tyma
  • 22. All the reasons ● nio: Asynchronous I/O is faster ● io: Synchronous I/O is faster ● nio: Thread context switching is slow ● io: Coding is much simpler ● nio: Threads take up too much memory ● io: Make better use of multi-cores ● nio: Synchronization among threads will kill you  ● nio: Thread-per-connection C2008 Paul Tyma scale 22 does not
  • 23. Multithreading   23 C2008 Paul Tyma
  • 24. Multithreading ● Hard – You might be saying “nah, its not bad” ● Let me rephrase – Hard, because everyone thinks it isn't ● http://today.java.net/pub/a/today/2007/06/28/extending-reentrantreadwritelock.html ● http://www.ddj.com/java/199902669?pgno=3 ● Much like generics however, you can't avoid it – good news is that the rewards are significant – Inherently takes advantage of multi-core systems   24 C2008 Paul Tyma
  • 25. Multithreading ● Linux 2.4 and before – Threading was quite abysmal – few hundred threads max ● Windows actually quite good ● up to 16000 threads – jvm limitations   25 C2008 Paul Tyma
  • 26. In walks NPTL ● Threading library standard in Linux 2.6 – linux 2.4 by option ● Idle thread cost is near zero ● context-switching is much much faster ● Possible to run many (many) threads ● http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library   26 C2008 Paul Tyma
  • 27. Thread context-switching is expensive ● Lower lines are faster – JDK1.6, Core duo ● Blue line represents up-to-1000 threads competing for the CPUs (core duo) ● notice behavior between 1 and 2 threads ● 1 or 1000 all fighting for the CPU, context switching doesn't cost much at all HashMap Gets ­ 0% Writes ­ Core Duo 2500 2250 2000 1750 1500 milliseconds HashMap 1250 ConcHashMap Cliff 1000 Tricky 750 500 250 0 1 2 5 100 500 1000 number of threads   27 C2008 Paul Tyma
  • 28. Synchronization is Expensive ● With only one thread, uncontended synchronization is cheap ● Note how even just 2 threads magnifies synch cost HashMap Gets ­ O% Writes ­ Core Duo Smaller is Faster 11000 10500 10000 9500 9000 8500 8000 7500 HashMap 7000 SyncHashMap 6500 Hashtable milliseconds 6000 5500 ConcHashMap 5000 Cliff 4500 Tricky 4000 3500 3000 2500 2000 1500 1000 1 2 5 100 500 1000 number of threads   28 C2008 Paul Tyma
  • 29. 4 core Opteron ● Synchronization magnifies method call overhead ● more cores, more “sink” in line ● non-blocking data structures do very well - we don't always need explicit synchronization HashMap Gets ­ 0% writes Smaller is faster 7000 6500 6000 5500 5000 HashMap 4500 SyncHashMap MS to completion 4000 Hashtable 3500 ConcHashMap Cliff 3000 Tricky 2500 2000 1500 1000 500 1 2 5 100 500 1000   29 number of threads C2008 Paul Tyma
  • 30. How about a lot more cores? guess: how many cores standard in 3 years? Azul 768core ­ 0% writes 22500 20000 17500 15000 HashMap SyncHashMap 12500 Hashtable ConcHashMap 10000 Cliff Tricky 7500 5000 2500 0 1 2 5 100 500 1000   30 C2008 Paul Tyma
  • 31. Threading Summary ● Uncontended Synchronization is cheap – and can often be free ● Contended Synch gets more expensive ● Nonblocking datastructures scale well ● Multithreaded programming style is quite viable even on single core   31 C2008 Paul Tyma
  • 32. All the reasons ● io: Synchronous I/O is faster ● nio: Thread context switching is slow ● io: Coding is much simpler ● nio: Threads take up too much memory ● io: Make better use of multi-cores ● nio: Synchronization among threads will kill you ● nio: Thread-per-connection does not scale   32 C2008 Paul Tyma
  • 33. Many possible NIO server configurations ● True single thread – Does selects – reads – writes – and backend data retrieval/writing ● from/to db ● from/to disk ● from/to cache   33 C2008 Paul Tyma
  • 34. Many possible NIO server configurations ● True single thread – This doesn't take advantage of multicores – Usually just one selector ● Usually use a threadpool to do backend work ● NIO must lock to give writes back to net thread ● Interview question: – What's harder, synchronizing 2 threads or synchronizing 1000 threads?   34 C2008 Paul Tyma
  • 35. All the reasons ● io: Synchronous I/O is faster ● io: Coding is much simpler ● nio: Threads take up too much memory CHOOSE ● io: Make better use of multi-cores ● nio: Synchronization among threads will kill you   35 C2008 Paul Tyma
  • 36. The story of Rob Van Behren (as remembered by me from a lunch with Rob) ● Set out to write a high-performance asynchronous server system ● Found that when switching between clients, the code for saving and restoring values/state was difficult ● Took a step back and wrote a finely-tuned, organized system for saving and restoring state between clients ● When he was done, he sat back and realized he had written the foundation for a threading package   36 C2008 Paul Tyma
  • 37. Synchronous I/O state is kept in control flow   // exceptions and such left out   InputStream inputStream = socket.getInputStream();   OutputStream outputStream = socket.getOutputStream();   String command = null;   do {        command = inputStream.readUTF();    } while ((!command.equals(“HELO”) && sendError()); do {  command = inputStream.readUTF(); } while ((!command.startsWith(“MAIL FROM:”) && sendError()); handleMailFrom(command); do {  command = inputStream.readUTF(); } while ((!command.startsWith(“RCPT TO:”) && sendError()); handleRcptTo(command);   do {  command = inputStream.readUTF(); } while ((!command.equals(“DATA”) && sendError()); handleData();   37 C2008 Paul Tyma
  • 38. Server in Action   38 C2008 Paul Tyma
  • 39. Our 2 server designs (synchronous multithreaded) ● One thread per connection – All code is written as if your server can only handle one connection – And all data structures can be manipulated by many other threads – Synchronization can be tricky ● Reads are blocking – Thus when waiting for a read, a thread is completely idle – Writing is blocking too, but is not typically significant ● Hey, what about scaling?   39 C2008 Paul Tyma
  • 40. Synchronous Multithreaded ● Mailinator server: – Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 – Runs both HTTP and SMTP servers ● Quiz: – How many threads can a computer run? – How many threads should a computer run?   40 C2008 Paul Tyma
  • 41. Synchronous Multithreaded ● Mailinator server: – Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 ● How many threads can a computer run? – Each thread in java x32 requires 48k of stack space ● linux java limitation, not OS (windows = 2k?) – option -Xss:48k (512k by default) – ~41666 stack frames – not counting room for OS, java, other data, C2008 Paul Tyma   41 etc
  • 42. Synchronous Multithreaded ● Mailinator server: – Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 ● How many threads should a computer run? – Similar to asking, how much should you eat for dinner? ● usual answer: until you've had enough ● usual answer: until you're full ● fair answer: until you're sick ● odd answser: until you're dead   42 C2008 Paul Tyma
  • 43. Synchronous Multithreaded ● Mailinator server: – Quad opteron, 2GB of ram, 100Mbps ethernet, linux 2.6, java 1.6 ● How many threads should a computer run? – “Just enough to max out your CPU(s)” ● replace “CPU” with any other resource – How many threads should you run for 100% cpu bound tasks? ● maybe #ofCores+1 or so   43 C2008 Paul Tyma
  • 44. Synchronous Multithreaded ● Note that servers must pick a saturation point ● Thats either CPU or network or memory or something ● Typically serving a lot of users well is better than serving a lot more very poorly (or not at all) ● You often need “push back” such that too much traffic comes all the way back to the front   44 C2008 Paul Tyma
  • 45. Acceptor Threads Task Queue put fetch Worker Threads how big should the queue be? should it be unbounded? how many worker threads? what happens if it fills?   45 C2008 Paul Tyma
  • 46. Design Decisions   46 C2008 Paul Tyma
  • 47. ThreadPools ● Executors.newCachedThreadPool = evil, die die die ● Tasks goto SynchronousQueue – i.e., direct hand off from task-giver to new thread ● unused threads eventually die (default: 60sec) ● new threads are created if all existing are busy – but only up to MAX_INT threads (snicker) ● Scenario: CPU is pegged with work so threads aren't finishing fast, more tasks arrive, more threads are created to handle new work – when CPU is pegged, more threads is not what you need   47 C2008 Paul Tyma
  • 48. Blocking Data Structures ● BlockingQueue – canonical “hand-off” structure – embedded within Executors – Rarely want LinkedBlockingQueue ● i.e. more commonly use ArrayBlockingQueue ● removals can be blocking ● insertions can wake-up sleeping threads ● In IO we can hand the worker threads the socket   48 C2008 Paul Tyma
  • 49. NonBlocking Data Structures ● ConcurrentLinkedQueue – concurrent linkedlist based on CAS – elegance is downright fun – No data corruption or blocking happens regardless of the number of threads adding or deleting or iterating – Note that iteration is “fuzzy”   49 C2008 Paul Tyma
  • 50. NonBlocking Data Structures ● ConcurrentHashMap – Not quite as concurrent – non-blocking reads – stripe-locked writes – Can increase parallelism in constructor ● Cliff Click's NonBlockingHashMap – Fully non-blocking – Does surprisingly well with many writes – Also has NonBlockingLongHashMap (thanks Cliff!)   50 C2008 Paul Tyma
  • 51. BufferedStreams ● BufferedOutputStream – simply creates an 8k buffer – requires flushing – can immensely improve performance if you keep sending small sets of bytes – The native call to do a send of a byte wraps it in an array and then sends that – Look at BufferedStreams like adding a memory copy between your code and the send   51 C2008 Paul Tyma
  • 52. BufferedStreams ● BufferedOutputStream – If your sends are broken up, use a buffered stream – if you already package your whole message into a byte array, don't buffer ● i.e., you already did – Default buffer in BufferedOutputStream is 8k – what if your message is bigger?   52 C2008 Paul Tyma
  • 53. Bytes ● Java programmers seem to have a fascination with Strings – fair enough, they are a nice human abstraction ● Can you keep your server's data solely in bytes? ● Object allocation is cheap but a few million objects are probably measurable ● String manipulation is cumbersome internally ● If you can stay with byte arrays, it won't hurt ● Careful with autoboxing too   53 C2008 Paul Tyma
  • 54. Papers (indirect references) ● “Why Events are a Bad Idea (For high- concurrency servers)”, Rob von Behren ● Dan Kegel's C10K problem – somewhat dated now but still great depth   54 C2008 Paul Tyma
  • 55. Designing Servers   55 C2008 Paul Tyma
  • 56. A static Web Server ● HTTP is stateless – very nice, one request, one response – Many clients, many, short connections – Interestingly for normal GET operations (ajax too) we only block as they call in. After that, they are blocked waiting for our reply – Threads must cooperate on cache   56 C2008 Paul Tyma
  • 57. Thread per request cache Worker Worker Worker disk Client ● Create a fixed number of threads in the beginning ● (not in a thread pool) ● All threads block at accept(), then go process the client   ● Watch socket timeouts 57 C2008 Paul Tyma ● Catch all exceptions
  • 58. Thread per request cache Acceptor create new Thread Worker Worker Worker disk Client Client ● Thread creation cost is historically expensive ● Load tempered by keeping track of the number of threads alive   ● not terribly precise 58 C2008 Paul Tyma
  • 59. Thread per request cache Acceptor Worker Queue Worker Worker disk Client Client ● Executors provide many knobs for thread tuning ● Handle dead threads ● Queue size can indicate load (to a degree) ● Acceptor threads can help out easily   59 C2008 Paul Tyma
  • 60. A static Web Server ● Given HTTP is request-reply – The longer transactions take, the more threads you need (because more are waiting on the backend) – For a smartly cached static webserver, 2 core machine, 15-20 threads saturated the system – Add a database backend or other complex processing per transaction and number of threads linearly increases   60 C2008 Paul Tyma
  • 61. SMTP Server a much more chatty protocol   61 C2008 Paul Tyma
  • 62. An SMTP server ● Very conversational protocol ● Server spends a lot of time waiting for the next command (like many milliseconds) ● Many threads simply asleep ● Few hundred threads very common ● Few thousand not uncommon – (JVM max allow 32k)   62 C2008 Paul Tyma
  • 63. An SMTP server ● All threads must cooperate on storage of messages – (keep in mind the concurrent data structures!) ● Possible to saturate the bandwidth after a few thousand threads – or the disk   63 C2008 Paul Tyma
  • 64. Coding ● Multithreaded server coding is more intuitive – You simply follow the flow of whats going to happen to one client ● Non-blocking data structures are very fast ● Immutable data is fast ● Sticking with bytes-in, bytes-out is nice – might need more utility methods   64 C2008 Paul Tyma
  • 65. Questions?   65 C2008 Paul Tyma