What to Expect from the Session
Understanding of the factors that goes into choosing an EC2
instance
Defining system performance and how it is characterized for
different workloads
How Amazon EC2 instances deliver performance while providing
flexibility and agility
How to make the most of your EC2 instance experience through the
lens of several instance types
What’s a Virtual CPU? (vCPU)
A vCPU is typically a hyper-threaded physical core*
On Linux, “A” threads enumerated before “B” threads
On Windows, threads are interleaved
Divide vCPU count by 2 to get core count
Cores by EC2 & RDS DB Instance type:
https://aws.amazon.com/ec2/virtualcores/
* The “t” family is special
Disable Hyper-Threading If You Need To
Useful for FPU heavy applications
Use ‘lscpu’ to validate layout
Hot offline the “B” threads
for i in `seq 64 127`; do
echo 0 > /sys/devices/system/cpu/cpu${i}/online
done
Set grub to only initialize the first half of
all threads
maxcpus=63
[ec2-user@ip-172-31-7-218 ~]$ lscpu
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 4
NUMA node(s): 4
Model name: Intel(R) Xeon(R) CPU
Hypervisor vendor: Xen
Virtualization type: full
NUMA node0 CPU(s): 0-15,64-79
NUMA node1 CPU(s): 16-31,80-95
NUMA node2 CPU(s): 32-47,96-111
NUMA node3 CPU(s): 48-63,112-127
Resource Allocation
All resources assigned to you are dedicated to your instance with no
over commitment*
All vCPUs are dedicated to you
Memory allocated is assigned only to your instance
Network resources are partitioned to avoid “noisy neighbors”
Curious about the number of instances per host? Use “Dedicated
Hosts” as a guide.
*Again, the “T” family is special
“Launching new instances and running tests
in parallel is easy…[when choosing an
instance] there is no substitute for measuring
the performance of your full application.”
- EC2 documentation
Timekeeping Explained
Timekeeping in an instance is deceptively hard
gettimeofday(), clock_gettime(), QueryPerformanceCounter()
The TSC
CPU counter, accessible from userspace
Requires calibration, vDSO
Invariant on Sandy Bridge+ processors
Xen pvclock; does not support vDSO
On current generation instances, use TSC as clocksource
Benchmarking - Time Intensive Application
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
time_t start,end;
time (&start);
for ( int x = 0; x < 100000000; x++ ) {
float f;
float g;
float h;
f = 123456789.0f;
g = 123456789.0f;
h = f * g;
struct timeval tv;
gettimeofday(&tv, NULL);
}
time (&end);
double dif = difftime (end,start);
printf ("Elasped time is %.2lf seconds.n", dif );
return 0;
}
P-state and C-state control
c4.8xlarge, d2.8xlarge, m4.10xlarge,
m4.16xlarge, p2.16xlarge, x1.16xlarge,
x1.32xlarge
By entering deeper idle states, non-idle
cores can achieve up to 300MHz higher
clock frequencies
But… deeper idle states require more
time to exit, may not be appropriate for
latency-sensitive workloads
Limit c-state by adding
“intel_idle.max_cstate=1” to grub
Tip: P-state control for AVX2
If an application makes heavy use of AVX2 on all cores, the processor
may attempt to draw more power than it should
Processor will transparently reduce frequency
Frequent changes of CPU frequency can slow an application
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
See also: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/processor_state_control.html
Review: T2 Instances
Lowest cost EC2 instance at $0.0065 per hour
Burstable performance
Fixed allocation enforced with CPU credits
Model vCPU Baseline CPU Credits
/ Hour
Memory
(GiB)
Storage
t2.nano 1 5% 3 .5 EBS Only
t2.micro 1 10% 6 1 EBS Only
t2.small 1 20% 12 2 EBS Only
t2.medium 2 40%** 24 4 EBS Only
t2.large 2 60%** 36 8 EBS Only
General purpose, web serving, developer environments, small databases
How Credits Work
A CPU credit provides the performance of a
full CPU core for one minute
An instance earns CPU credits at a steady rate
An instance consumes credits when active
Credits expire (leak) after 24 hours
Baseline rate
Credit
balance
Burst
rate
Review: X1 Instances
Largest memory instance with 2 TB of DRAM
Quad socket, Intel E7 processors with 128 vCPUs
Model vCPU Memory (GiB) Local Storage Network
x1.16xlarge 64 976 1x 1920GB SSD 10Gbps
x1.32xlarge 128 1952 2x 1920GB SSD 20Gbps
In-memory databases, big data processing, HPC workloads
NUMA
Non-uniform memory access
Each processor in a multi-CPU system has
local memory that is accessible through a
fast interconnect
Each processor can also access memory
from other CPUs, but local memory access
is a lot faster than remote memory
Performance is related to the number of
CPU sockets and how they are connected -
Intel QuickPath Interconnect (QPI)
Tip: Kernel Support for NUMA Balancing
An application will perform best when the threads of its processes are
accessing memory on the same NUMA node.
NUMA balancing moves tasks closer to the memory they are accessing.
This is all done automatically by the Linux kernel when automatic NUMA
balancing is active: version 3.8+ of the Linux kernel.
Windows support for NUMA first appeared in the Enterprise and Data
Center SKUs of Windows Server 2003.
Set “numa=off” or use numactl to reduce NUMA paging if your
application uses more memory than will fit on a single socket or has
threads that move between sockets
Operating Systems Impact Performance
Memory intensive web application
Created many threads
Rapidly allocated/deallocated memory
Comparing performance of RHEL6 vs RHEL7
Notice high amount of “system” time in top
Found a benchmark tool (ebizzy) with a similar performance profile
Traced it’s performance with “perf”
On RHEL6
[ec2-user@ip-172-31-12-150-RHEL6 ebizzy-0.3]$ sudo perf stat ./ebizzy -S 10
12,409 records/s
real 10.00 s
user 7.37 s
sys 341.22 s
Performance counter stats for './ebizzy -S 10':
361458.371052 task-clock (msec) # 35.880 CPUs utilized
10,343 context-switches # 0.029 K/sec
2,582 cpu-migrations # 0.007 K/sec
1,418,204 page-faults # 0.004 M/sec
10.074085097 seconds time elapsed
On RHEL7
[ec2-user@ip-172-31-7-22-RHEL7 ~]$ sudo perf stat ./ebizzy-0.3/ebizzy -S 10
425,143 records/s
real 10.00 s
user 397.28 s
sys 0.18 s
Performance counter stats for './ebizzy-0.3/ebizzy -S 10':
397515.862535 task-clock (msec) # 39.681 CPUs utilized
25,256 context-switches # 0.064 K/sec
2,201 cpu-migrations # 0.006 K/sec
14,109 page-faults # 0.035 K/sec
10.017856000 seconds time elapsed
Up from 12,400 records/s!
Down from 1,418,204!
Hugepages
Disable Transparent Hugepages
# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
Use Explicit Huge Pages
$ sudo mkdir /dev/hugetlbfs
$ sudo mount -t hugetlbfs none /dev/hugetlbfs
$ sudo sysctl -w vm.nr_hugepages=10000
$ HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so numactl --cpunodebind=0
--membind=0 /path/to/application
See also: https://lwn.net/Articles/375096/
Hardware
Split Driver Model
Driver Domain Guest Domain Guest Domain
VMM
Frontend
driver
Frontend
driver
Backend
driver
Device
Driver
Physical
CPU
Physical
Memory
Storage
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
23
4
5
Granting in pre-3.8.0 Kernels
Requires “grant mapping” prior to 3.8.0
Grant mappings are expensive operations due to TLB flushes
SSD
Inter domain I/O:
(1) Grant memory
(2) Write to ring buffer
(3) Signal event
(4) Read ring buffer
(5) Map grants
(6) Read or write grants
(7) Unmap grants
read(fd, buffer,…)
I/O domain Instance
Granting in 3.8.0+ Kernels, Persistent and Indirect
Grant mappings are set up in a pool one time
Data is copied in and out of the grant pool
SSD
read(fd, buffer…)
I/O domain Instance
Grant pool
Copy to
and from
grant pool
2009 – Longer ago than you think
Avatar was the top movie in the theaters
Facebook overtook MySpace in active users
President Obama was sworn into office
The 2.6.32 Linux kernel was released
Tip: Use 3.10+ kernel
Amazon Linux 13.09 or later
Ubuntu 14.04 or later
RHEL/Centos 7 or later
Etc.
Device Pass Through: Enhanced Networking
SR-IOV eliminates need for driver domain
Physical network device exposes virtual function to instance
Requires a specialized driver, which means:
Your instance OS needs to know about it
EC2 needs to be told your instance can use it
Hardware
After Enhanced Networking
Driver Domain Guest Domain Guest Domain
VMM
NIC
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
2
3
NIC
Driver
Elastic Network Adapter
Next Generation of Enhanced
Networking
Hardware Checksums
Multi-Queue Support
Receive Side Steering
20Gbps in a Placement Group
New Open Source Amazon Network
Driver
Network Performance
20 Gigabit & 10 Gigabit
Measured one-way, double that for bi-directional (full duplex)
High, Moderate, Low – A function of the instance size and EBS
optimization
Not all created equal – Test with iperf if it’s important!
Use placement groups when you need high and consistent instance
to instance bandwidth
All traffic limited to 5 Gb/s when exiting EC2
EBS Performance
Instance size affects throughput
Match your volume size and
type to your instance
Use EBS optimization if EBS
performance is important
Choose HVM AMIs
Timekeeping: use TSC
C state and P state controls
Monitor T2 CPU credits
Use a modern Linux OS
NUMA balancing
Persistent grants for I/O performance
Enhanced networking
Profile your application
Summary: Getting the Most Out of EC2 Instances
Bare metal performance goal, and in many scenarios already there
History of eliminating hypervisor intermediation and driver domains
Hardware assisted virtualization
Scheduling and granting efficiencies
Device pass through
Virtualization Themes
Next Steps
Visit the Amazon EC2 documentation
Launch an instance and try your app!
Let’s start at the basics
What is an EC2 instance?
They are virtual machines
Guests
On a Hypervisor
On physical hardware
Presentation built to be a deep dive
Going into the depths of how EC2 works
Highlight actionable things
Get the most performance out of your instances
Talk a bit about how to choose an EC2 instance
When it comes to performance
Making sure you’re picking the right one is as important as the tuning tips that I’m also going to talk about
EC2 is a big subject
Talk about the Purchase Options
APIs & SDK’s
Networking
Talk today about the instances themselves
How they operate
Features
Options when you go to launch
Other Topics
List Recommended sessions at the end
Let’s start at the basics
What is an EC2 instance?
They are virtual machines
Guests
On a Hypervisor
On physical hardware
Launched in 2006
“an instance”
Didn’t have a name
Did get any choices
Like the Model T – any color as long as black
Eventually gave it a name
M1 instance
Customers wanted more choice
We’ve been iterating and growing ever since.
Not only adding instances, but changing how EC2 works
Launched the cc2 in 2011
Placement groups
Bandwidth and latency
Hardware assisted virtualization
Exposes more of the underlying hardware
Lets you get even more performance
EC2 is always growing and changing based on customer feedback
Always check our documentation for the latest as you’re building out
How we do things today may be different in the future
Go over how we talk about instances and name them
Get on the same page
First letter is the family
Stands for what it’s suited for or what resources it has
C for compute
R for Ram
I for IOPS
Number is the generation
Like a version number
Last is the instance size
T-Shirt size
You’ve got a lot of choices and flexibility when you go to launch
It can seem overwhelming
Trying to pick the right instance
Looking at just the families
First find what your application is constrained by
If you need memory, start with R3
CPU, go with C4
If balanced, look at general purpose M or T
Perspective of your constraint, it’s easy to pick the right family
Test to find the right size within that family
If you need a little help, check the documentation
List of workloads for each family.
When you’re looking at instances, you’ll see something call vCPUs
On modern instances not in the T family
An hyperthreaded core
Hyperthreading is great to increase performance
I kinda lets your CPU do two things at once
Like waiting on IO
Real core count
Divide by two
Visit link, used for licensing.
To give a visual representation…
Output of LSTOPO on m4.10xlarge
Linux utility for enumerating hardware
Can run on any instance or physical server
Shows graphical output of hardware configuration
Sockets
Memory on each socket
L1-3 Cache
CPU thread to core mapping
Case of m4.10xlarge
40 threads
20 cores
Some applications don’t benefit from hyperthreading
Context switching may decrease performance
Typically compute heavy apps
Financial calculations & engineering simulations
These apps usually disable hyperthreading
If you’re not sure or don’t typically disable hyperthreading, don’t worry
If you do, try to run with it disabled it on EC2 and see if it improves performance
Easy on Linux, harder on Windows
Linux
The first set of threads on each cores is listed first, and the second or B threads are listed after that
Disable the last half, which will be all the B threads
Two ways
Online
Great for no reboot
But it may cause instability
Disable processors where threads may be running
Won’t be persisted after a reboot
In grub,
Set max cpus to match physical cpu count minus 1
Safer – disabled when booting
But makes it harder when you switch size
Windows is harder
Interleaved
Have to use CPU affinity
Same m4.10xlarge with hyperthreading turned off
Only one CPU thread per core
Comapred to the two that you saw earlier
Let’s dig into how instance sizes work
We build instances
Easy to scale vertically and horizontally
Look at the c4 family as an example
C4.8xlarge on the left
Largest instance size available
That single c4.8xlarge
Roughly equal to 2 c4.4xlarges
That c4.4xlarge has roughly half the
Number of vCPUS
Amount of ram
Available network bandwidth
Keeps following down the line
2x c4.4xlarge = 4x c4.2xlarge
And so on…
Reason is because of how we partition instances
Largest size is typically a full server
On the smaller one’s you’re running a fraction of it depending on the size
Virtualization historically has a bad reputation
Usually used to manage over utilization of resources
More virtual machines than physical resources
We use virtualization for a lot of other reasons
Security & Isolation
Dedicate specific resources to specific customers
vCPUS as an example
With exception of T
When you’re assigned a vCPU
only customer using it
Not sharing with anyone else on the box
Same applies to Memory & Network
We build with the goal of providing a consistent experience
No matter what else is happening
Last thing I want to say about choosing your instance
Cheesy to quote documentation
Good sentiment
Easy to get an app up and running
Don’t run synthetic benchmarks
Install your app and send some realistic load
Examples:
Mobile App
HPC application
BI database
Use a real workload to understand how your app will behave
Digging deeper into the OS…
On all systems, time keeping is important
Used for things like
Processing interrupts
Getting the time and date
Measuring performance
Most AMI’s on AWS use Xen clock by default
Compatible with all instance types
TSC was introduced in Sandy Bridge
Handled by bare metal
You’re talking to your processor
Not the hypervisor
And because of this, calls to it are going to be much quicker
To demonstrate this – simple application
It does two things
Performs a large number of get time of day calls
a bit of math
Don’t laugh at my code…
I’m a sysadmin, not a developer
Quick and dirty to test it out
These are results on Xen clock source
Profiled with Strace
Really great tool to use with any app, yours included
Shows the number of system calls make
& the time they took
Gettimeofday take the most amount of time with a lot of calls
Overall, the test took about 12 seconds to run with Xen clock source
On the same system
Switched clocksource to TSC
Reran the test
Results look a lot different
Gettimeofday doesn’t show up
Run time reduced to two seconds
This is extreme for a simple app
I’ve seen apps improve by as much as 40%
It’s an easy change to make on Linux
Do it while the system is running
First command shows available clock sources
Second shows the current clock source
Third would change it to TSC
On windows, it’s handled automatically
If you’re running a recently released EC2 instance
Can improve a lot of apps
JVM debugging
Performance tracing
SAP applications
Recently change to the platform
added P & C state control to the platform with C4, now available on many more
First, let’s talk about C states
C states control the power savings features of a processor
Using c4.8xlarge as an example
Base clock speed of 2.9Ghz
Can turbo up to 3.5Ghz on one or two cores
Must let other cores idle down
Great when you need a few cores to have high frequencies
Letting them idle down
increase the time it takes for them to respond when you want to actually use them
So if you have an application where latency is important
You can limit how deep they’ll sleep
Setting cstate parameter in grub
You can use P state to set the desired running frequency of the cores
Some customers and some workloads
consistency is more important than performance
Some Game servers good example
Operate in loops
Loop needs to complete in the same time, every time
You can set the P state to prevent the processor to prevent it from scaling up and down
Operates at the same frequency all the time
Next I want to talk about T2 and why they’re special
T2 instance are great general purpose instances
Lowest costs instance available on AWS at ~1/2 a cent per hour for t2.nano
Great for workloads where CPU demand varies over time
Websites
Developer environments
Small database
You start with a baseline level of performance
That you can see in the chart above
The magic of T2 is that you earn credits when the instance is idle
Allows you to burst above the baseline
We launched T2 because we saw that most workloads aren’t using 100% of CPU all of the time
T2 family is a great way to
Still get the performance you need when you need it
Don’t pay for it when you don’t
Let’s talk about how credits work
You can think of credits in a T2 like a bucket
When you boot the instance
Start with enough credits for OS & Application
When your app is up and running, you’ll use credits when you use CPU
A single credit will let you run 100% of one core for one minute
When the work dies down and instance becomes idle
Earning new credits that will start to file up the bucket
Credits also expire after 24 hours if unused
To monitor those instances
Cloudwatch Metrics
Two available
The one in Orange is the credit usage
Spikes when usage is high
Shows you how many credits you’re using per minute
The Blue is the Balance
Keep this above zero if you want more performance than baseline
Monitoring your credit balance lets you ensure you’re getting consistent performance on a T2
What you’ll want to hook on if you’re using autoscaling
Recently launched the X1
Biggest instance
2TB of RAM
128 Virtual CPUs
Great for apps that need a huge memory footprint
Good for:
In memory databases
big data processing
some HPC
When you have that much memory
Managing is important
On any system with multiple sockets
Memory attached to local socket will be faster than remote
Concept is called NUMA
On Intel, there’s a QPI between sockets
It’s the bus that transfer memory from one to another
Look at the r3.8xlarge as an example
Two sockets
122GB of ram on each socket
Between are 2 QPI links
Application on the left reading from the right
Will go over the QPI
Fast, but not as fast as what’s attached directly
When you go to X1, things are more complex
X1 is a 4 socket system
Numa is more important
Compared to an r3.8xlarge
More memory per socket
Only one QPI between sockets
Memory transfers from one zone to another are going to take longer on X1
So what can we do?
If you’ve ever watched top on a linux system
shows threads moving from one core to another
Process scheduling to make sure work is balanced
Around 3.8 kernel, started to use NUMA affinity
Will try to keep processes in same NUMA zone
Will also try to move memory around to be close to the process
The downside is that this can actually slow down performance on some apps
Especially true if you have a large memory pool spanning sockets
The scheduler will be moving things around when it doesn’t need to be
To disable, set NUMA=off in grub
will disable memory transfers between zones
disable NUMA awareness for process scheduling
Alternative is to use numactl to lock processes to a specific zone
Only be reading and writing memory that’s local to them
Another thing to keep in mind
Operating system and libraries can effect application performance
Use not is running a modern linux kernel important
Run as recent of a distro as you can
Recent customer visit
Custom Application using a large amount of memory
EC2 performance not as good as on premise
Their app was very complex and it was hard to get quick results when making changes
Found a benchmark tool (ebizzy) with similar behavior to test
Results of ebizzy on RHEL6
Used perf to profile and see what’s happening at a system level
Generated 12,000 requests/second
Lots of time in system space
1.5 million page faults
Generated flame graphs to see what’s happening
Created by Brendan Gregg, check out his site for more information
A really good way to understand
Paths the code is taking a system
Time spent in specific calls
You can see ebizzy on the bottom
Making lots of madvise calls
End up with a xen hypercall
Compiled the same app on RHEL7 and tested on same instance type
Saw significantly better performance
RPS went from 12,000 – 425,000
Page faults went from 1.5 million to only 14,000
What happened?
This is where flamegraphs really shine
Same exact flame graph
Same Code
Sam run type
Only difference is the OS version
What the flamegraph showed us
Glibc Changed the path memory calls on RHEL7
Instead of long madvase with trip to hypervisor
Single intel optimized call for memory management
Recompile when moving to a different OS, it can make a big difference
Last memory related tip is to Disable transparent huge pages
Huge pages are a really big subject with a lot of different options
See article
It goes into detail about all the different options
Transparent is enabled by default on most recent distributions
Disabling transparent and using explicit
Can help significantly for apps that are accessing a lot of memory
Next, let’s talk about IO
We have a few families that are optimized for IO
I2 – IOPS – SSD Based
D2 – Dense storage - Magetic
Need a modern kernel to get best storage performance
Reason is split driver model
Application on left doing some disk IO
Talks to the front end driver
Then back end
Then real driver
Then hardware
Data transfer happens through shared pages
Need permissions to be granted and released
Granting had lots of overhead in early kernels
Every time it needs to write to disk
Talks to VMM
Get permission to write to device
Fill a buffer with the data
Pass to backend
Wait for data to be written
Remove the grant
Really expensive process, lots of buffer flushing
Gets worse the more CPUs you have
Persistent grants created to solve this.
Permission to write is reused for all transactions between front and back
Grants don’t need to be unmapped
Translation buffer never flushed
Much better performance for IO operations
Validating grants is easy
Run dmesg and grep for blockfront
This is i2.8xlarge
All volumes have persistant grants enabled.
If I haven’t said it enough
Using a modern kernel is really important
Many customers still use Centos6
Just by switching OS’s to 3.10
Seen as much as a 60% improvement
2.6 Kernel in Centos6 released in 2009
Long time ago in the cloud computing world
Please use a modern Kernel & OS.
Same lines as split driver model
Release enhanced networking with C3
Uses Single Root IO Virtualization – SRIOV
Physical device exposed to instance itself
Has a few requirements
Needs a special driver installed in the OS
EC2 needs to be told to expose it that way
Network path is much simpler
Packets don’t’ have to go through the VMM
Higher packets per second
Decreased jitter – talking to bare metal
It’s free on all supported instance
Enabled by default in many AMI’s
Highly recommend it if you’re touching the network
And we’re not done improving the network
Still making constant improvements
Latest is with a new Network Adapter
Launched with the X1
Called Elastic Network Adapter – ENA
Built a new Amazon developed Open source Driver
Will grow with us as we’re adding new features to the network
Built to handle throughputs up to 20 Gigabits/second
This + Hardware checksums & RSS make it the fastest network available on AWS today.
Touch briefly on network
Touch on a few points about network performance
Attend the Deep dive to learn more
Easy to forget that network can be a bottleneck on smaller instance types
Customer doing S3 performance testing
Not getting good performance
Found out all network traffic was going through a T2 NAT
Largest instances should get closer to 5Gbps when leaving EC2 and talking to things like S3
When we list 10 & 20 Gigabit bandwidth
Instance bandwidth is bi-directional
On p2, X1, and m4
20Gb in and Out at the same time
But you need placement groups and multiple TCP streams
Just like network throughput, EBS throughput is function of size of the instance
Larger instance, more EBS traffic
EBS optimization by default on newest instances
Don’t have to worry about network and EBS competing
Look at the EBS documentation
Table of every EBS optimized instance
Throughput and max IOPS
Great place to go to look for specific performance out of EBS
In conclusion
Lots of things
Getting the most out of it
At bare minimum
Benchmark your app
Use a modern OS
Monitor Cloudwatch
Use enhanced networking
Goal is to make virtualization as transparent as possible
Eliminate any inefficiencies it may cause
Goal of bare metal like performance
Already there in a lot of ways
So if you have any questions, the EC2 documentation is a great resource and covers even more than I could today. Otherwise, launch an instance and start testing your app. Thank you!
So if you have any questions, the EC2 documentation is a great resource and covers even more than I could today. Otherwise, launch an instance and start testing your app. Thank you!