2. Content
• What is Memcached
• Usage model
• Measuring performance
• Baseline performance & scalability
• Performance root cause
• Base transaction flow
• Optimization goals, design considerations
• Optimized transaction flow
• Optimization details
• Optimized version performance
• Summary
2
3. What is Memcached?
• Open Source distributed memory caching system
− Typically serves as a cache for persistent databases
− In-memory key-value store for quick data access
− For a particular “key” a “value” is stored/deleted/retrieved etc.
− Provides a networked data caching API simple to use and setup
• Used by many companies with web centric businesses
• Most common usage model - web data caching
− Original data resides in persistent database
− Database queries are expensive
− Memcached caches the data to provide low latency access
− Helps reduce the load on the database
• Computational cache
• Temporary object store
3
4. Web data caching usage model
• Memcached tier acts as a cache for the database tier
− Cache is spread over several memcached servers
• Client requests the “value” associated with a “key”
• A “GET” request for “key” sent to memcached
• If “key” found
− Memcached returns “value” for “key”
• If “key” not found
− Persistent database is queried for “key”
− “value” from database is returned to client
− “SET” request sent to MC with “key” & “value”
• Key-value pair stays in cache unless
− It is evicted because of cache LRU policies
− Explicitly removed by a “DELETE” request
• Typical operations
− GET, SET, DELETE, STATS, REPLACE, etc.
• Most frequent transaction is “GET”
− Impacts perf of most common use cases
4
5. Measuring performance
• Measure perf of most important transaction - “get”
• Best perf = max “get” Requests Per Sec (RPS) under SLA
− SLA (Service Level Agreement) : Average “get” latency <= 1 ms
• Measurement configuration is “client-server”
− Run memcached on one or more servers
− Run load generator/s on “client/s” to send requests to MC servers
− Load generator keeps track of transactions and reports results
• Process
− Load gen sends “set” requests to prime cache with key-value pairs
− For incremental RPS in a range, do following until avg latency >
1ms:
− Send random key “gets” for 60 secs, calculate average latency
• S/W and H/W configuration
− Open Source Memcached V 1.6 base and optimized
− Open Source Mcblaster load generator
− Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory
5
6. Baseline performance & core scalability
• Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory
• Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF
No scalability beyond 3 cores, degrades beyond 4
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance
tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,
6 Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF
7. Performance root cause
• Profile during “gets” shows lots of time spent in locks
• Drill down into code shows coarse grained global cache locks
− Held for most of a thread’s execution time
• Removing the global locks & measuring “gets” showed substantial improvement
− Unsafe, done only as a proof of concept
• “Top” shows unbalanced CPU core utilization, possibilities are:
− Sub-optimal network packet handling and distribution
− Thread migration between cores
7
8. Transaction flow
• Incoming requests from clients
• Libevent distributes them to MC threads
− # of MC threads = # of cores
− No thread affinity
• Threads do key hashing in parallel
• Hash table processing to
− Find place for new item (key-value pair)
− Find location of existing item
• LRU processing to maintain cache policy
− Move item to front of list indicating most
recently accessed
• A global cache lock around hash table
and LRU processing
− Serializes all transactions on all threads
− This is the key bottleneck to scalability
• Final responses handled in parallel
8
9. The hash table
• Hash table is arranged as an array of buckets
• Each bucket has a singly linked list as a hash chain
• The hashed key is used to find the bucket it belongs in
• Item (key-value pair) is then inserted/retrieved from in
the hash chain of that bucket
9
10. The LRU
• LRU - Least Recently Used cache management scheme
− Cache is finite amount - evict old items to make room for new ones
− LRU policy determines eviction order of cache items
− Oldest active cache item is evicted first
• Uses a doubly linked list for quick manipulation
− Head has most recently used item
− GET for item removes it from current position & moved to head
− On eviction the tail is checked for oldest item
10
11. Why the global lock
• Linked lists are used in both the hash table & the LRU
• Corruption can occur if the lock is removed
− Example below of two close by items being removed
− Higher chance of corruption in the LRU because of doubly linked list
11
12. Optimization goals, design considerations
• Goals
− Must scale well with larger core counts
− Hash distribution should have little effect on perf
− Same performance accessing 1 unique key or 100k unique keys
− Changes to LRU must maintain/increase hit rates
− ~90% with test data set
• Implementation considerations
− Any lock removal or reduction should be safe
− No additional data should be used for cache items
− Millions to billions of cache items in a fully populated instance
− A single 64-bit field would reduce useable memory considerably leading to a
reduced the hit rate
− Focus on GETs for best performance
− Most memcached instances are read dominated
− New design should account for this and optimize for read traffic
− Transaction ordering not guaranteed – just like the original
implementation
12
13. Optimized transaction flow
Original Optimized
• Global lock serializes Hash table and LRU • Non-blocking gets using a “Bag” LRU scheme
operations • Better parallelization for set/delete with striped locks
13
14. SET/DEL optimization - parallel hash table
• Uses striped locks instead of a global lock
− Fine grain collection of locks instead of a single global lock
• Makes use of a fixed-size, shared collection of locks for
the entire hash table
− Allows for a highly scalable hash table solution
− Fixed-overhead
• Number of locks is a ^2 to determine lock quickly
− Bitwise and the bucket with the number of locks to determine lock
• Not used for GETs
14
15. SET/DEL optimization - parallel hash table ..
• Each lock services Z number of buckets
• Number of locks, Z, based on balance between
parallelism and lock maintenance overhead
• Multiple buckets can be manipulated in parallel
15
16. GET optimization – removing the global lock
• No global lock during hash table processing for GET
• With no global lock, two situations must be handled
− Expansion of hash table during a GET
− Hash table expands if there are a lot more items than buckets can handle
− SET/DEL of an item during a GET
• Handling hash table expansion during GET
− If expanding then wait for it to finish before looking up hash chain
− If not expanding then find data in hash chain and return it
• Handling SET/DEL during a GET
− If hash table expanding, wait to finish before modifying hash chain
− Modify pointers in right order using atomic operations to ensure correct
hash chain traversal for GETs
• A GET may still happen while the item is being modified
(SET/DEL/REPLACE)
− Is that a problem?
− No, as long as traversal is correct, because operation order is not
guaranteed anyways
16
17. GET optimization – Parallel Bag LRU
• Replaces the original doubly linked list LRU
• Basic concept is to group items with similar time stamps
into “bags”
− As before, no ordering is guaranteed
• Has all the functionality as the original LRU
• Re-uses original item data structure – no additions
• SET to a bag uses atomic Compare and Swap operation
• GET from a bag is lockless
• DEL requests do nothing to the Bag LRU
• LRU cleanup is delegated to a “cleaner thread”
− Acts like “garbage collection/cleanup”
− Evicts expired items quickly
− Handles item cleanup from deletes
− Reorders cache items based on update time
− Adds additional Bags as needed
17
18. Parallel Bag LRU details – Bag Array
Original LRU
Bag LRU
• A list of bags in chronological order
• Bags have list of items
• Newest bag has recently allocated or accessed items
• Alternate bag used by cleaner thread to avoid lock contention on
inserts to newest bag
• Bag head has pointers to oldest and newest bags for quick access
18
19. Parallel Bag LRU details – Bags
• Each bag has a singly linked list of cache items
• SET causes new item to be inserted into “newest bag”
• GET updates timestamp & pointer to point to “newest
bag”
• Evictions handled by cleaner thread
19
20. Parallel Bag LRU – Cleaner Thread
• Periodically does house keeping on the Bag LRU
− Currently every 5 secs
• Starts cleaning from the oldest bag’s oldest item
20
21. Optimizations - Misc
• Used thread affinity to bind 1 memcached thread per core
• Configured NIC driver to evenly distribute incoming
packets over CPUs
− 1 NIC queue per logical CPU, affinitized to a logical CPU
• Irqbalance, iptables services turned off
21
22. Optimized performance & core scaling
• Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory
• Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF
Linear scaling with optimizations
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance
tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,
22 Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology OFF
23. Server capacity
• Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory
Overall 900% gains vs. baseline
Turbo and HT boost performance by 31%
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance
tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,
23 Intel® Turbo Boost Technology OFF/ON, Intel® Hyper-Threading Technology OFF/ON
24. Efficiency and hit rate
• Hit rate measured with a synthetic benchmark
increased slightly
− At ~90% - similar to that of original version
• Efficiency (Transactions Per watt) increased by 3.4X
− Mostly due to much higher RPS for little increase in power
− Power draw would be less in a production environment
• Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory
Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance
tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions.
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more
information go to http://www.intel.com/performance. Configuration: Intel® Xeon® E5-2660 2.2 GHz, 10GB NIC, 64 GB memory,
24 Intel® Turbo Boost Technology ON, Intel® Hyper-Threading Technology ON
25. Summary
• Base core/thread scalability hampered by locks
− No throughput scaling beyond 3 cores, degradation beyond 4
• Lockless “GETs” with Bag LRU improves scalability
− Linear till the measured 16 cores
− No increase in average latency
− No loss in hit rate (~90%)
− Same performance for random and hot/repeated keys
• Striped locks parallelize hash table access for SET/DEL
• Bag LRU source code available on GitHub
− https://github.com/rajiv-kapoor/memcached/tree/bagLRU
25
28. Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with
Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
29. Legal Disclaimer
• Built-In Security: No computer system can provide absolute security under all conditions. Built-in security features
available on select Intel® Core™ processors may require additional software, hardware, services and/or an Internet
connection. Results may vary depending upon configuration. Consult your PC manufacturer for more details.
• Enhanced Intel SpeedStep® Technology - See the Processor Spec Finder at http://ark.intel.com or contact your Intel
representative for more information.
• Intel® Hyper-Threading Technology (Intel® HT Technology) is available on select Intel® Core™ processors. Requires
an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the
specific hardware and software used. For more information including details on which processors support Intel HT
Technology, visit http://www.intel.com/info/hyperthreading.
• Intel® 64 architecture requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance
will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more
information. For more information, visit http://www.intel.com/info/em64t
• Intel® Turbo Boost Technology requires a system with Intel Turbo Boost Technology. Intel Turbo Boost Technology
and Intel Turbo Boost Technology 2.0 are only available on select Intel® processors. Consult your PC manufacturer.
Performance varies depending on hardware, software, and system configuration. For more information, visit
http://www.intel.com/go/turbo
• Other Software Code Disclaimer
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice (including the next paragraph) shall be included in all copies
or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT
NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
30. Risk Factors
The above statements and any others in this document that refer to plans and expectations for the second quarter, the year and the
future are forward-looking statements that involve a number of risks and uncertainties. Words such as “anticipates,” “expects,”
“intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations identify forward-looking statements.
Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements.
Many factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause
actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following
to be the important factors that could cause actual results to differ materially from the company’s expectations. Demand could be
different from Intel's expectations due to factors including changes in business and economic conditions, including supply constraints
and other disruptions affecting customers; customer acceptance of Intel’s and competitors’ products; changes in customer order
patterns including order cancellations; and changes in the level of inventory at customers. Uncertainty in global economic and
financial conditions poses a risk that consumers and businesses may defer purchases in response to negative financial events, which
could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are
characterized by a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly
variable and difficult to forecast. Revenue and the gross margin percentage are affected by the timing of Intel product introductions
and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors, including product offerings and
introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly
to technological developments and to incorporate new features into its products. Intel is in the process of transitioning to its next
generation of products on 22nm process technology, and there could be execution and timing issues associated with these changes,
including products defects and errata and lower than anticipated manufacturing yields. The gross margin percentage could vary
significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the
timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the
manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions
in the supply of materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including
manufacturing, assembly/test and intangible assets. The majority of Intel’s non-marketable equity investment portfolio balance is
concentrated in companies in the flash memory market segment, and declines in this market segment or changes in management’s
plans with respect to Intel’s investments in this market segment could result in significant impairment charges, impacting
restructuring charges as well as gains/losses on equity investments and interest and other. Intel's results could be affected by
adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers
operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and
fluctuations in currency exchange rates. Expenses, particularly certain marketing and compensation expenses, as well as
restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of revenue and
profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by
adverse effects associated with product defects and errata (deviations from published specifications), and by litigation or regulatory
matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues, such as the litigation and
regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an injunction
prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s
ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion
of these and other factors that could affect Intel’s results is included in Intel’s SEC filings, including the company’s most recent Form
10-Q, Form 10-K and earnings release.
Rev. 5/4/12
31. Summary
Memcached is a popular key-value caching service used by web
service delivery companies to reduce the latency of serving data
to consumers and reduce load on back-end database servers.
It has a scale out architecture that easily supports increasing
throughput by simply adding more memcached servers, but at
the individual server level scaling up to higher core counts is
less rewarding. In this talk we introduce optimizations that
break through such scalability barriers and allow all cores in a
server to be used effectively. We explain new algorithms
implemented to achieve an almost 6x increase in throughput
while maintaining a 1ms average latency SLA by utilizing
concurrent data structures, a new cache replacement policy and
network optimizations.
31