SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
s
Plan
• Storage hierarchy
• CPU architecture
• The TLB
• The Huge Pages
• The Transparent Huge Pages
• VT-x (virtualization impact on memory access,
Couchbase benchmark, sysbench)
• The QPI Link (Impala benchmark)
• Hyperthreading (HPL/Linpack and HPCG)
• Containers vs VMs (Docker)
Why should we care?
• “Memory is the new disk!“
• “Disk is the new tape!”
• “Tape is …”
• Is it really that easy?
latency (nanoseconds) vs scaled to“human time”
ns scaled to s
1 cpu cycle 0.3 1s
L1 cache hit 0.9 3s
L2 2.8 9s
L3 12.9 43s
LMA 60 3m
RMA 120 7m
TLB Cache miss 240 13m
SSD disk IO 100,000 4d
Rotational disk IO 10,000,000 1y
Internet San Francisco to United Kingdom 81,000,000 8y
Storage hierarchies - It used to be like this:
latency(nanoseconds)
0
2500000
5000000
7500000
10000000
1 cpu cycle L1 cache hit L2 L3 LMA RMA TLB cache miss SSD Disk
10,000,000
100,0004801206012.92.80.90.3
latency (nanoseconds)
Storage hierarchies - Now it’s more like this:
latency(nanoseconds)
0
75
150
225
300
1 cpu cycle L1 cache hit L2 L3 LMA RMA TLB cache miss
240
120
60
12.92.80.90.3
latency (nanoseconds)
CPU architecture - It used to be like this:
• Single core
• Linear memory access times
• Simple cache hierarchy
• Very small memory capacities
CPU
Memory
Memory Ctrl
L1 cache
CPU architecture - Now it’s more like this:
• Multiple cores
• Multiple memory controllers
• QPI links
• More complex cache hierarchies
socket A socket B
Memory Ctrl
Memory Memory
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
L3 L3
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
L1
L2
CPU
Memory Ctrl
QPI link
Implications
• Algorithms don’t have to tradeoff computational efficiency for memory efficiency any more.
• Algorithms need to be parallel by design.
• The QPI link becomes an issue (with LMA =1/2 RMA).
• The TLB cache miss becomes an issue.
• The memory frequency and DIMM placement becomes an issue.
The cache hierarchies
C1
L1
L2
L3
Cn
L1
L2
...
QPI
Memory Ctrl. 4 chan
C1
L1
L2
L3
Cn
L1
L2
...
QPI
Memory Ctrl. 4 chan
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
1 cycle 0.3 ns
Registers/Buffers
0.9 ns
2.8 ns
12.9 ns
QPI 60 ns
60 ns
64KB
256KB
20MB
16GB
16GB
16GB
16GB
PCIe ctrl.
(40 lanes)
PCIe ctrl.
(40 lanes)
CPU socket 1 CPU socket 2
QPI Link implications
• LMA = 1/2 latency of RMA
• Every request to a‘remote’memory has to
traverse the QPI link.
• Dual CPU machines are for many applications
worse than single socket machines.
• Solutions: CPU affinity setting with Docker,
numactl, numad, libnuma, numatop,
PontusVision
0
2
4
6
8
1x
E5-2430


32G
B
R
AM
2x
E5-2430

32G
B
R
AM
1x
E5-2690

128G
B
R
AM
2x
E5-2690

128G
B
R
AM
6.66
7.7
5.41
6.18
Impala score*
Source: Bigstep & Cloudera benchmark done in
2014
What happens when a program tries to access a memory cell?
TLB Operation
Page # Offset
Virtual Address
TLB
Page Table
+
Tag Reminder
Cache
TLB
Miss TLB Hit
Main Memory
Cache Operation
Cache
Miss
ValueHit
Value
How often does a TLB Miss occur?
Source:“Memory System Characterization of
Big Data Workloads”by Martin Dimitrov et all.- Intel Corp. [2013]
TLBMissesperthousandof
instructions
0
0.45
0.9
1.35
1.8
Hive Aggr C Hive join C NoSQL Index Sort NC WC NC
1.7
1.8
1.49
0.75
1.7
0.550.6
0.7
0.6
0.7
0.5
0.12
0.8
0.27
0.150.090.130.1
Instruction TLB miss per thousand of instructions Data TLB miss per thousand of instructions
c: compressed data
nc: uncompressed data
The TLB and virtualization
• Impact: On big data technologies it occurs about once or twice per 1000 instructions (about every 1us)
• One TLB miss on bare metal = twice the DRAM latency
• One TLB miss on VM (with VT-X) = up to 12 times the DRAM latency
• Solutions: Use huge pages, don’t use virtualization, don’t use transparent huge pages
“THP is not recommended for database workloads.”

source: Redhat perf. tuning guide
“[…] the TLB miss latency when using hardware assistance is significantly higher.”

Source: Methodology for Performance Analysis of VMware vSphere under Tier-1 Applications
TLB and virtualization
Source: internal Bigstep benchmarks done in 2014
and presented at various events
0s
1.25s
2.5s
3.75s
5s
sysbench multi-threading performance
5s
1s
Native Virtual
0s
10s
20s
30s
40s
sysbench memory 1TB read (1M bs), write total time
32s
25s
Native Virtual
TLB and virtualization
0
45,000
90,000
135,000
180,000
Average Requests/second

16 bytes records
Average Requests/second 

512 bytes records
53,200
68,840
179,366
168,662
Bigstep (bare metal) AWS (VM based)
• 2 x FMCI 4.16 

(4-Cores - 8 with HT, 16 GB RAM Centos 6.5)
• 2 x m3.2xlarge 

(8 cores, 30 GB of RAM) instances RHEL6.5
• Note: AWS is here just because they use
virtualisation but this is true for every VM
based hosts.
Source: Bigstep benchmarks done in 2014 and
presented at Couchbase Live and HUG London
A word on Intel’s Hyper-Threading
• Hyper-threading is a method of executing 2 instructions in the same core at the same time (while the
CPU gets the required memory required to execute an instruction the other one can execute some
tasks)
• Is this twice the performance? Actually it’s about the same or worse with it.
• The caches are shared for HT‘cores’.
• Clouds sell a‘virtual core’which is actually a hyper-threaded core = half of a real core’s“performance”.
Containers vs VMs
Guest
Process
Guest
Process
Isolation enforcing layer
Host OS (linux)
Hardware
Guest
OS
Guest
OS
Virtualization layer
Host OS (linux)
Hardware
Guest
Process
Guest
Process
Containers VMs
• Native like Cache efficiency
• No TLB miss amplification
• NUMA node affinity control
• Native performance
Containers vs VMs - isolation
LXC Xen
CPU Stress 0 0
Memory 88.2% 0.9%
Disk Stress 9% 0
Fork Bomb Did not run 0
Network Receiver 2.2% 0.9%
Network Sender 10.3% 0.3%
Source: Performance Evaluation of Container-based Virtualization for
High Performance Computing Environments Miguel G et all. PUCRS 2014
The results represent how much
the application performance
is impacted by different stress
tests in another VM/Container.
Containers vs NativeAverageresponsetime(us)
smallerisbetter
0
6
11
17
22
INSERT SELECT UPDATE
11
19
21
10
18
19
1 Node native 1 Node Native 1 docker container
Source: Bigstep’s Cassandra benchmark presented at C* summit
London 2014
Network performance
• Network is very much dependent on
memory access speeds and offloading
capabilities.
• Memory access is delayed so is a network
packet that goes through the virtual
stack
• In virtual hosts switching is done in
software hence has all these issues.
• TOE and RDMA support are available in
some clouds (including Bigstep).
Source: Performance Evaluation of Container-based Virtualization for
High Performance Computing Environments Miguel G et all. PUCRS 2014
Bare metal = no cloud goodies?
New breed of“bare metal" clouds emerging. 

Bigstep is one of them:
• Pay per use (actually per second)
• Single tenant bare metal
• Brilliant performance
• Provisioning times: 2-3 minutes (the time it takes a server to
boot up).
• Stop and Resume support
• Snapshot and rollback support
• Upgrades and downgrades with a reboot
• Low latency baremetal network
• UI with drag and drop
Key take-aways for Big Data workloads
• Start thinking in terms of memory & CPU architecture when sizing, operating and developing high
memory footprint applications.
• Memory access times are the new performance metric, look for it.
• Avoid virtualization whenever possible.
• Checkout the new“baremetal”cloud providers.
• Use Docker if you need consolidation ratios and better isolation.
• Use numatop to checkout RMA to LMA ratios, use numad like irqbalance. Manually control with
numactl if required.
• Always use huge pages, disable THP for databases.
Memory, Big Data, NoSQL and Virtualization

Weitere ähnliche Inhalte

Was ist angesagt?

Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
jbellis
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
DataStax Academy
 
OSS Presentation Metro Cluster by Andy Bennett & Roel De Frene
OSS Presentation Metro Cluster by Andy Bennett & Roel De FreneOSS Presentation Metro Cluster by Andy Bennett & Roel De Frene
OSS Presentation Metro Cluster by Andy Bennett & Roel De Frene
OpenStorageSummit
 

Was ist angesagt? (20)

Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex LauDoing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
Doing QoS Before Ceph Cluster QoS is available - David Byte, Alex Lau
 
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity HardwareMirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
Mirantis, Openstack, Ubuntu, and it's Performance on Commodity Hardware
 
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
Your Linux AMI: Optimization and Performance (CPN302) | AWS re:Invent 2013
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016
 
Stabilizing Ceph
Stabilizing CephStabilizing Ceph
Stabilizing Ceph
 
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster 5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
5 Reasons to Use Arm-Based Micro Server Architecture for Ceph Cluster
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)
 
Geneve
GeneveGeneve
Geneve
 
z/VM Performance Analysis
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance Analysis
 
Nexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on LabNexenta at VMworld Hands-on Lab
Nexenta at VMworld Hands-on Lab
 
Evaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERNEvaluation of RBD replication options @CERN
Evaluation of RBD replication options @CERN
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Bluestore oio adaptive_throttle_analysis
Bluestore oio adaptive_throttle_analysisBluestore oio adaptive_throttle_analysis
Bluestore oio adaptive_throttle_analysis
 
Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Implementing distributed mclock in ceph
Implementing distributed mclock in cephImplementing distributed mclock in ceph
Implementing distributed mclock in ceph
 
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph EnterpriseRed Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
Red Hat Enterprise Linux OpenStack Platform on Inktank Ceph Enterprise
 
Не так страшен терабит / Вячеслав Ольховченков (Integros)
Не так страшен терабит / Вячеслав Ольховченков (Integros)Не так страшен терабит / Вячеслав Ольховченков (Integros)
Не так страшен терабит / Вячеслав Ольховченков (Integros)
 
Improvements in GlusterFS for Virtualization usecase
Improvements in GlusterFS for Virtualization usecaseImprovements in GlusterFS for Virtualization usecase
Improvements in GlusterFS for Virtualization usecase
 
OSS Presentation Metro Cluster by Andy Bennett & Roel De Frene
OSS Presentation Metro Cluster by Andy Bennett & Roel De FreneOSS Presentation Metro Cluster by Andy Bennett & Roel De Frene
OSS Presentation Metro Cluster by Andy Bennett & Roel De Frene
 

Ähnlich wie Memory, Big Data, NoSQL and Virtualization

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Ontico
 
ClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptxClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptx
BiHongPhc
 

Ähnlich wie Memory, Big Data, NoSQL and Virtualization (20)

High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
IMCSummit 2015 - Day 2 IT Business Track - 4 Myths about In-Memory Databases ...
 
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIsDX12 & Vulkan: Dawn of a New Generation of Graphics APIs
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
Accelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cacheAccelerating hbase with nvme and bucket cache
Accelerating hbase with nvme and bucket cache
 
LMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging LibraryLMAX Disruptor - High Performance Inter-Thread Messaging Library
LMAX Disruptor - High Performance Inter-Thread Messaging Library
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
ClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptxClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptx
 
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
LUG 2014
LUG 2014LUG 2014
LUG 2014
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
 

Mehr von Bigstep

Mehr von Bigstep (8)

Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and ExasolStart Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
Start Making Big Data With SQL and RDBMS Skills - Webinar by Bigstep and Exasol
 
How to Automate Big Data with Ansible
How to Automate Big Data with AnsibleHow to Automate Big Data with Ansible
How to Automate Big Data with Ansible
 
Couchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance BenchmarkCouchbase In The Cloud - A Performance Benchmark
Couchbase In The Cloud - A Performance Benchmark
 
Couchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with CouchbaseCouchdoop: Connecting Hadoop with Couchbase
Couchdoop: Connecting Hadoop with Couchbase
 
Building a Hadoop Connector
Building a Hadoop Connector Building a Hadoop Connector
Building a Hadoop Connector
 
Getting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DBGetting the Most Out of Your NoSQL DB
Getting the Most Out of Your NoSQL DB
 
Getting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimizationGetting the most out of Impala - Best practices for infrastructure optimization
Getting the most out of Impala - Best practices for infrastructure optimization
 

Memory, Big Data, NoSQL and Virtualization

  • 1. s
  • 2.
  • 3. Plan • Storage hierarchy • CPU architecture • The TLB • The Huge Pages • The Transparent Huge Pages • VT-x (virtualization impact on memory access, Couchbase benchmark, sysbench) • The QPI Link (Impala benchmark) • Hyperthreading (HPL/Linpack and HPCG) • Containers vs VMs (Docker)
  • 4. Why should we care? • “Memory is the new disk!“ • “Disk is the new tape!” • “Tape is …” • Is it really that easy? latency (nanoseconds) vs scaled to“human time” ns scaled to s 1 cpu cycle 0.3 1s L1 cache hit 0.9 3s L2 2.8 9s L3 12.9 43s LMA 60 3m RMA 120 7m TLB Cache miss 240 13m SSD disk IO 100,000 4d Rotational disk IO 10,000,000 1y Internet San Francisco to United Kingdom 81,000,000 8y
  • 5. Storage hierarchies - It used to be like this: latency(nanoseconds) 0 2500000 5000000 7500000 10000000 1 cpu cycle L1 cache hit L2 L3 LMA RMA TLB cache miss SSD Disk 10,000,000 100,0004801206012.92.80.90.3 latency (nanoseconds)
  • 6. Storage hierarchies - Now it’s more like this: latency(nanoseconds) 0 75 150 225 300 1 cpu cycle L1 cache hit L2 L3 LMA RMA TLB cache miss 240 120 60 12.92.80.90.3 latency (nanoseconds)
  • 7. CPU architecture - It used to be like this: • Single core • Linear memory access times • Simple cache hierarchy • Very small memory capacities CPU Memory Memory Ctrl L1 cache
  • 8. CPU architecture - Now it’s more like this: • Multiple cores • Multiple memory controllers • QPI links • More complex cache hierarchies socket A socket B Memory Ctrl Memory Memory L1 L2 CPU L1 L2 CPU L1 L2 CPU L1 L2 CPU L1 L2 CPU L1 L2 CPU L3 L3 L1 L2 CPU L1 L2 CPU L1 L2 CPU L1 L2 CPU L1 L2 CPU L1 L2 CPU Memory Ctrl QPI link
  • 9. Implications • Algorithms don’t have to tradeoff computational efficiency for memory efficiency any more. • Algorithms need to be parallel by design. • The QPI link becomes an issue (with LMA =1/2 RMA). • The TLB cache miss becomes an issue. • The memory frequency and DIMM placement becomes an issue.
  • 10. The cache hierarchies C1 L1 L2 L3 Cn L1 L2 ... QPI Memory Ctrl. 4 chan C1 L1 L2 L3 Cn L1 L2 ... QPI Memory Ctrl. 4 chan DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM 1 cycle 0.3 ns Registers/Buffers 0.9 ns 2.8 ns 12.9 ns QPI 60 ns 60 ns 64KB 256KB 20MB 16GB 16GB 16GB 16GB PCIe ctrl. (40 lanes) PCIe ctrl. (40 lanes) CPU socket 1 CPU socket 2
  • 11. QPI Link implications • LMA = 1/2 latency of RMA • Every request to a‘remote’memory has to traverse the QPI link. • Dual CPU machines are for many applications worse than single socket machines. • Solutions: CPU affinity setting with Docker, numactl, numad, libnuma, numatop, PontusVision 0 2 4 6 8 1x E5-2430 
 32G B R AM 2x E5-2430
 32G B R AM 1x E5-2690
 128G B R AM 2x E5-2690
 128G B R AM 6.66 7.7 5.41 6.18 Impala score* Source: Bigstep & Cloudera benchmark done in 2014
  • 12. What happens when a program tries to access a memory cell? TLB Operation Page # Offset Virtual Address TLB Page Table + Tag Reminder Cache TLB Miss TLB Hit Main Memory Cache Operation Cache Miss ValueHit Value
  • 13. How often does a TLB Miss occur? Source:“Memory System Characterization of Big Data Workloads”by Martin Dimitrov et all.- Intel Corp. [2013] TLBMissesperthousandof instructions 0 0.45 0.9 1.35 1.8 Hive Aggr C Hive join C NoSQL Index Sort NC WC NC 1.7 1.8 1.49 0.75 1.7 0.550.6 0.7 0.6 0.7 0.5 0.12 0.8 0.27 0.150.090.130.1 Instruction TLB miss per thousand of instructions Data TLB miss per thousand of instructions c: compressed data nc: uncompressed data
  • 14. The TLB and virtualization • Impact: On big data technologies it occurs about once or twice per 1000 instructions (about every 1us) • One TLB miss on bare metal = twice the DRAM latency • One TLB miss on VM (with VT-X) = up to 12 times the DRAM latency • Solutions: Use huge pages, don’t use virtualization, don’t use transparent huge pages “THP is not recommended for database workloads.”
 source: Redhat perf. tuning guide “[…] the TLB miss latency when using hardware assistance is significantly higher.”
 Source: Methodology for Performance Analysis of VMware vSphere under Tier-1 Applications
  • 15. TLB and virtualization Source: internal Bigstep benchmarks done in 2014 and presented at various events 0s 1.25s 2.5s 3.75s 5s sysbench multi-threading performance 5s 1s Native Virtual 0s 10s 20s 30s 40s sysbench memory 1TB read (1M bs), write total time 32s 25s Native Virtual
  • 16. TLB and virtualization 0 45,000 90,000 135,000 180,000 Average Requests/second
 16 bytes records Average Requests/second 
 512 bytes records 53,200 68,840 179,366 168,662 Bigstep (bare metal) AWS (VM based) • 2 x FMCI 4.16 
 (4-Cores - 8 with HT, 16 GB RAM Centos 6.5) • 2 x m3.2xlarge 
 (8 cores, 30 GB of RAM) instances RHEL6.5 • Note: AWS is here just because they use virtualisation but this is true for every VM based hosts. Source: Bigstep benchmarks done in 2014 and presented at Couchbase Live and HUG London
  • 17. A word on Intel’s Hyper-Threading • Hyper-threading is a method of executing 2 instructions in the same core at the same time (while the CPU gets the required memory required to execute an instruction the other one can execute some tasks) • Is this twice the performance? Actually it’s about the same or worse with it. • The caches are shared for HT‘cores’. • Clouds sell a‘virtual core’which is actually a hyper-threaded core = half of a real core’s“performance”.
  • 18. Containers vs VMs Guest Process Guest Process Isolation enforcing layer Host OS (linux) Hardware Guest OS Guest OS Virtualization layer Host OS (linux) Hardware Guest Process Guest Process Containers VMs • Native like Cache efficiency • No TLB miss amplification • NUMA node affinity control • Native performance
  • 19. Containers vs VMs - isolation LXC Xen CPU Stress 0 0 Memory 88.2% 0.9% Disk Stress 9% 0 Fork Bomb Did not run 0 Network Receiver 2.2% 0.9% Network Sender 10.3% 0.3% Source: Performance Evaluation of Container-based Virtualization for High Performance Computing Environments Miguel G et all. PUCRS 2014 The results represent how much the application performance is impacted by different stress tests in another VM/Container.
  • 20. Containers vs NativeAverageresponsetime(us) smallerisbetter 0 6 11 17 22 INSERT SELECT UPDATE 11 19 21 10 18 19 1 Node native 1 Node Native 1 docker container Source: Bigstep’s Cassandra benchmark presented at C* summit London 2014
  • 21. Network performance • Network is very much dependent on memory access speeds and offloading capabilities. • Memory access is delayed so is a network packet that goes through the virtual stack • In virtual hosts switching is done in software hence has all these issues. • TOE and RDMA support are available in some clouds (including Bigstep). Source: Performance Evaluation of Container-based Virtualization for High Performance Computing Environments Miguel G et all. PUCRS 2014
  • 22. Bare metal = no cloud goodies? New breed of“bare metal" clouds emerging. 
 Bigstep is one of them: • Pay per use (actually per second) • Single tenant bare metal • Brilliant performance • Provisioning times: 2-3 minutes (the time it takes a server to boot up). • Stop and Resume support • Snapshot and rollback support • Upgrades and downgrades with a reboot • Low latency baremetal network • UI with drag and drop
  • 23.
  • 24.
  • 25. Key take-aways for Big Data workloads • Start thinking in terms of memory & CPU architecture when sizing, operating and developing high memory footprint applications. • Memory access times are the new performance metric, look for it. • Avoid virtualization whenever possible. • Checkout the new“baremetal”cloud providers. • Use Docker if you need consolidation ratios and better isolation. • Use numatop to checkout RMA to LMA ratios, use numad like irqbalance. Manually control with numactl if required. • Always use huge pages, disable THP for databases.