Amazon EC2 provides a broad selection of instance types to deliver high performance for a diverse mix of applications. In this session, we overview the drivers of system performance and discuss in depth how Amazon EC2 instances deliver system performance while also providing elasticity and complete control over your infrastructure. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
2. Understanding the factors that go into choosing an EC2 instance
Defining system performance and how it is characterized for
different workloads
How Amazon EC2 instances deliver performance while providing
flexibility and agility
How to make the most of your EC2 instance experience through the
lens of several instance types
What to Expect from the Session
8. Choices and Flexibility
Choice of Processor
Memory
Storage Options
Accelerated Graphics
Burstable Performance
9. Servers are hired to do jobs
Performance is measured differently depending on the job
Hiring a Server
?
10. Performance Factors
Resource Performance factors Key indicators
CPU Sockets, number of cores, clock
frequency, bursting capability
CPU utilization, run queue length
Memory Memory capacity Free memory, anonymous paging,
thread swapping
Network
interface
Max. bandwidth, packet rate Receive throughput, transmit throughput
over max. bandwidth
Disks Input/output operations per second,
throughput
Wait queue length, device utilization,
device errors
11. Resource Utilization
For given performance, how efficiently are
resources being used?
Something at 100% utilization can’t
accept any more work
Low utilization can indicate more resource
is being purchased than needed
12. Example: Web Application
MediaWiki installed on Apache with 140 pages of content
Load increased in intervals over time
17. “Launching new instances and running tests
in parallel is easy…[when choosing an
instance] there is no substitute for measuring
the performance of your full application.”
- EC2 documentation
18. How Not to Choose an EC2 Instance
Brute force testing
Ignoring metrics
Favoring old generation instances
Guessing based on what you already have
20. Give back instances as easily as you can acquire new ones
Find an ideal instance type and workload combination
EC2 Instance Pages provide “Use Case” Guidance
With Amazon Elastic Block Store, storage and instance size don’t
need to be coupled
Instance Selection = Performance Tuning
22. Choosing the Right Size
Understand your unit of work
Web request
Database/table
Batch process
What is that unit’s requirements?
CPU threads
Memory constraints
Disk and network
What are it’s availability requirements?
23. CPU Instructions and Protection Levels
CPU has at least two protection levels.
Privileged instructions can’t be executed in user mode to protect
system. Applications leverage system calls to the kernel.
Kernel
Application
24. VMM
Application
Kernel
PV
X86 CPU Virtualization: Prior to Intel VT-x
Binary translation for privileged instructions
Paravirtualization (PV)
PV requires going through the VMM, adding latency
Applications that are system call bound are most affected
25. Kernel
Application
VMM
PV-HVM
X86 CPU Virtualization: After Intel VT-x
Hardware-assisted virtualization (HVM)
PV-HVM uses PV drivers opportunistically for operations that
are slow emulated:
E.g., network and block I/O
27. Time Keeping Explained
Time keeping in an instance is deceptively hard
gettimeofday(), clock_gettime(), QueryPerformanceCounter()
The TSC
CPU counter, accessible from userspace
Requires calibration, vDSO
Invariant on Sandy Bridge+ processors
Xen pvclock; does not support vDSO
On current generation instances, use TSC as clocksource
29. Review: C4 Instances
Custom Intel E5-2666 v3 at 2.9 GHz
P-state and C-state controls
Model vCPU Memory (GiB) EBS (Mbps)
c4.large 2 3.75 500
c4.xlarge 4 7.5 750
c4.2xlarge 8 15 1,000
c4.4xlarge 16 30 2,000
c4.8xlarge 36 60 4,000
Batch and HPC Workloads, Game Servers, Ad Serving, and High Traffic Web Servers
30. What’s New in C4: P-State and C-State Control
Intel Turbo Boost up to 3.5 Ghz
By entering deeper idle states, non-idle cores can achieve up to 300 MHz
higher clock frequencies
But…deeper idle states require more time to exit, may not be appropriate
for latency-sensitive workloads
31. Tip: P-State Control for AVX2
If an application makes heavy use of AVX2 on all cores, the processor
may attempt to draw more power than it should
Processor will transparently reduce frequency
Frequent changes of CPU frequency can slow an application
32. Review: T2 Instances
Lowest-cost EC2 instance at $0.0065 per hour
Burstable performance
Fixed allocation enforced with CPU credits
Model vCPU Baseline CPU Credits
/Hour
Memory
(GiB)
Storage
t2.nano 1 5% 3 .5 EBS Only
t2.micro 1 10% 6 1 EBS Only
t2.small 1 20% 12 2 EBS Only
t2.medium 2 40%** 24 4 EBS Only
t2.large 2 60%** 36 8 EBS Only
General Purpose, Web Serving, Developer Environments, Small Databases
33. How Credits Work
A CPU credit provides the performance of a
full CPU core for one minute
An instance earns CPU credits at a steady rate
An instance consumes credits when active
Credits expire (leak) after 24 hours
Baseline rate
Credit
balance
Burst
rate
35. Tip: How to Interpret Steal Time
Fixed CPU allocations of CPU can be offered through CPU caps
Steal time happens when CPU cap is enforced
Leverage Amazon CloudWatch metrics
36. Announced: X1 Instances
Largest memory instance with 2 TB of DRAM
Quad socket, Intel E7 processors with 128 vCPUs
Model vCPU Memory (GiB) Local
Storage
x1.32xlarge 128 1952 2x 1920 GB
In-Memory Databases, Big Data Processing, HPC Workloads
37. NUMA
Non-uniform memory access
Each processor in a multi-CPU system has local memory that is
accessible through a fast interconnect
Each processor can also access memory from other CPUs, but local
memory access is a lot faster than remote memory
Performance is related to the number of CPU sockets and how they
are connected - Intel QuickPath Interconnect (QPI)
40. Tip: Kernel Support for NUMA Balancing
An application will perform best when the threads of its processes are
accessing memory on the same NUMA node.
NUMA balancing moves tasks closer to the memory they are accessing.
This is all done automatically by the Linux kernel when automatic NUMA
balancing is active: version 3.8+ of the Linux kernel.
Windows support for NUMA first appeared in the Enterprise and Data
Center SKUs of Windows Server 2003.
41. Review: I2 Instances
16 vCPU: 3.2 TB SSD; 32 vCPU: 6.4 TB SSD
365 K random read IOPS for 32 vCPU instance
Model vCPU Memory
(GiB)
Storage Read IOPS Write IOPS
i2.xlarge 4 30.5 1 x 800 SSD 35,000 35,000
i2.2xlarge 8 61 2 x 800 SSD 75,000 75,000
i2.4xlarge 16 122 4 x 800 SSD 175,000 155,000
i2.8xlarge 32 244 8 x 800 SSD 365,000 315,000
NoSQL Databases, Clustered Databases, Online Transaction Processing (OLTP)
42. Hardware
Split-Driver Model
Driver Domain Guest Domain Guest Domain
VMM
Front-end
driver
Front-
end
driver
Back-
end
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
23
4
5
43. Granting in Pre-3.8.0 Kernels
Requires “grant mapping” prior to 3.8.0
Grant mappings are expensive operations due to TLB flushes
read(fd, buffer,…)
I/O domain Instance
44. Granting in 3.8.0+ Kernels, Persistent and Indirect
Grant mappings are set up in a pool once
Data is copied in and out of the grant pool
read(fd, buffer…)
Copy to
and from
grant pool
45. 2009–Longer Ago Than You Think
Avatar was the top movie in the theaters
Facebook overtook MySpace in active users
President Obama was sworn into office
The 2.6.32 Linux kernel was released
46. Tip: Use 3.8+ Kernel
Amazon Linux 13.09 or later
Ubuntu 14.04 or later
RHEL/Centos 7 or later
Etc.
47. Device Passthrough: Enhanced Networking
SR-IOV eliminates need for driver domain
Physical network device exposes virtual function to instance
Requires a specialized driver, which means:
Your instance OS needs to know about it
EC2 needs to be told your instance can use it
48. Hardware
After Enhanced Networking
Driver Domain Guest Domain Guest Domain
VMM
NIC
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPU
Virtual
Memory
CPU
Scheduling
Sockets
Application
1
2
3
NIC
Driver
49. Elastic Network Adapter
Next Generation of Enhanced Networking
Hardware checksums
Multi-Queue support
Receive-side steering
20 Gbps in a placement group
New Open Source Amazon Network Driver
50. EBS Performance
Instance size matters
Match your volume size and
type to your instance
Use EBS optimization if EBS
performance is important
51. Choose HVM AMI’s
Time keeping: use TSC
C state and P state controls
Monitor T2 CPU credits
Use a modern Linux kernel
NUMA balancing
Persistent grants for I/O performance
Enhanced networking
Summary: Getting the Most Out of EC2 Instances
52.
53. Bare metal performance goal, and in many scenarios already there
History of eliminating hypervisor intermediation and driver domains
Hardware-assisted virtualization
Scheduling and granting efficiencies
Device passthrough
Virtualization Themes
54. Next Steps
Visit the Amazon EC2 documentation
Launch an instance and try your app!