Más contenido relacionado

Presentaciones para ti(20)



Similar a Deep Dive on Amazon EC2(20)

Más de Amazon Web Services(20)



Deep Dive on Amazon EC2

  1. Deep Dive on Amazon EC2 Instances Featuring Performance Optimization Best Practices By Androski Spicer, Solutions Architect © 2016 Amazon Web Services, Inc. and its affiliates. All rights served. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services, Inc.
  2. What to Expect from the Session  Understanding of the factors that goes into choosing an EC2 instance  Defining system performance and how it is characterized for different workloads  How Amazon EC2 instances deliver performance while providing flexibility and agility  How to make the most of your EC2 instance experience through the lens of several instance types
  3. API EC2 EC2 Amazon Elastic Compute Cloud is Big Instances Networking Purchase options
  4. Host Server Hypervisor Guest 1 Guest 2 Guest n Amazon EC2 Instances
  5. In the past  First launched in August 2006  M1 instance  “One size fits all” M1
  6. Amazon EC2 Instances History 2006 2008 2010 2012 2014 2016 m1.small m1.large m1.xlarge c1.medium c1.xlarge m2.xlarge m2.4xlarge m2.2xlarge cc1.4xlarge t1.micro cg1.4xlarge cc2.8xlarge m1.medium hi1.4xlarge m3.xlarge m3.2xlarge hs1.8xlarge cr1.8xlarge c3.large c3.xlarge c3.2xlarge c3.4xlarge c3.8xlarge g2.2xlarge i2.xlarge i2.2xlarge i2.4xlarge i2.4xlarge m3.medium m3.large r3.large r3.xlarge r3.2xlarge r3.4xlarge r3.8xlarge t2.micro t2.small c4.large c4.xlarge c4.2xlarge c4.4xlarge c4.8xlarge d2.xlarge d2.2xlarge d2.4xlarge d2.8xlarge g2.8xlarge t2.large m4.large m4.xlarge m4.2xlarge m4.4xlarge m4.10xlarge x1.32xlarge t2.nano m4.16xlarge p2.xlarge p2.8xlarge p2.16xlarge
  7. Instance generation c4.xlarge Instance family Instance size
  8. EC2 Instance Families General purpose Compute optimized C3 Storage and I/O optimized I2 P2 GPU optimized Memory optimized R3C4 M4 D2 X1 G2
  9. What’s a Virtual CPU? (vCPU)  A vCPU is typically a hyper-threaded physical core*  On Linux, “A” threads enumerated before “B” threads  On Windows, threads are interleaved  Divide vCPU count by 2 to get core count  Cores by EC2 & RDS DB Instance type: * The “t” family is special
  10. Disable Hyper-Threading If You Need To  Useful for FPU heavy applications  Use ‘lscpu’ to validate layout  Hot offline the “B” threads for i in `seq 64 127`; do echo 0 > /sys/devices/system/cpu/cpu${i}/online done  Set grub to only initialize the first half of all threads maxcpus=63 [ec2-user@ip-172-31-7-218 ~]$ lscpu CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 4 NUMA node(s): 4 Model name: Intel(R) Xeon(R) CPU Hypervisor vendor: Xen Virtualization type: full NUMA node0 CPU(s): 0-15,64-79 NUMA node1 CPU(s): 16-31,80-95 NUMA node2 CPU(s): 32-47,96-111 NUMA node3 CPU(s): 48-63,112-127
  11. Instance sizing c4.8xlarge 2 - c4.4xlarge ≈ 4 - c4.2xlarge ≈ 8 - c4.xlarge ≈
  12. Resource Allocation  All resources assigned to you are dedicated to your instance with no over commitment*  All vCPUs are dedicated to you  Memory allocated is assigned only to your instance  Network resources are partitioned to avoid “noisy neighbors”  Curious about the number of instances per host? Use “Dedicated Hosts” as a guide. *Again, the “T” family is special
  13. “Launching new instances and running tests in parallel is easy…[when choosing an instance] there is no substitute for measuring the performance of your full application.” - EC2 documentation
  14. Timekeeping Explained  Timekeeping in an instance is deceptively hard  gettimeofday(), clock_gettime(), QueryPerformanceCounter()  The TSC  CPU counter, accessible from userspace  Requires calibration, vDSO  Invariant on Sandy Bridge+ processors  Xen pvclock; does not support vDSO  On current generation instances, use TSC as clocksource
  15. Benchmarking - Time Intensive Application #include <sys/time.h> #include <time.h> #include <stdio.h> #include <unistd.h> int main() { time_t start,end; time (&start); for ( int x = 0; x < 100000000; x++ ) { float f; float g; float h; f = 123456789.0f; g = 123456789.0f; h = f * g; struct timeval tv; gettimeofday(&tv, NULL); } time (&end); double dif = difftime (end,start); printf ("Elasped time is %.2lf seconds.n", dif ); return 0; }
  16. Using the Xen Clock Source [centos@ip-192-168-1-77 testbench]$ strace -c ./test Elasped time is 12.00 seconds. % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.99 3.322956 2 2001862 gettimeofday 0.00 0.000096 6 16 mmap 0.00 0.000050 5 10 mprotect 0.00 0.000038 8 5 open 0.00 0.000026 5 5 fstat 0.00 0.000025 5 5 close 0.00 0.000023 6 4 read 0.00 0.000008 8 1 1 access 0.00 0.000006 6 1 brk 0.00 0.000006 6 1 execve 0.00 0.000005 5 1 arch_prctl 0.00 0.000000 0 1 munmap ------ ----------- ----------- --------- --------- ---------------- 100.00 3.323239 2001912 1 total
  17. Using the TSC Clock Source [centos@ip-192-168-1-77 testbench]$ strace -c ./test Elasped time is 2.00 seconds. % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 32.97 0.000121 7 17 mmap 20.98 0.000077 8 10 mprotect 11.72 0.000043 9 5 open 10.08 0.000037 7 5 close 7.36 0.000027 5 6 fstat 6.81 0.000025 6 4 read 2.72 0.000010 10 1 munmap 2.18 0.000008 8 1 1 access 1.91 0.000007 7 1 execve 1.63 0.000006 6 1 brk 1.63 0.000006 6 1 arch_prctl 0.00 0.000000 0 1 write ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000367 53 1 total
  18. Change with: Tip: Use TSC as clocksource
  19. P-state and C-state control  c4.8xlarge, d2.8xlarge, m4.10xlarge, m4.16xlarge, p2.16xlarge, x1.16xlarge, x1.32xlarge  By entering deeper idle states, non-idle cores can achieve up to 300MHz higher clock frequencies  But… deeper idle states require more time to exit, may not be appropriate for latency-sensitive workloads  Limit c-state by adding “intel_idle.max_cstate=1” to grub
  20. Tip: P-state control for AVX2  If an application makes heavy use of AVX2 on all cores, the processor may attempt to draw more power than it should  Processor will transparently reduce frequency  Frequent changes of CPU frequency can slow an application sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo" See also:
  21. Review: T2 Instances  Lowest cost EC2 instance at $0.0065 per hour  Burstable performance  Fixed allocation enforced with CPU credits Model vCPU Baseline CPU Credits / Hour Memory (GiB) Storage t2.nano 1 5% 3 .5 EBS Only t2.micro 1 10% 6 1 EBS Only t2.small 1 20% 12 2 EBS Only t2.medium 2 40%** 24 4 EBS Only t2.large 2 60%** 36 8 EBS Only General purpose, web serving, developer environments, small databases
  22. How Credits Work  A CPU credit provides the performance of a full CPU core for one minute  An instance earns CPU credits at a steady rate  An instance consumes credits when active  Credits expire (leak) after 24 hours Baseline rate Credit balance Burst rate
  23. Tip: Monitor CPU credit balance
  24. Review: X1 Instances  Largest memory instance with 2 TB of DRAM  Quad socket, Intel E7 processors with 128 vCPUs Model vCPU Memory (GiB) Local Storage Network x1.16xlarge 64 976 1x 1920GB SSD 10Gbps x1.32xlarge 128 1952 2x 1920GB SSD 20Gbps In-memory databases, big data processing, HPC workloads
  25. NUMA  Non-uniform memory access  Each processor in a multi-CPU system has local memory that is accessible through a fast interconnect  Each processor can also access memory from other CPUs, but local memory access is a lot faster than remote memory  Performance is related to the number of CPU sockets and how they are connected - Intel QuickPath Interconnect (QPI)
  26. QPI 122GB 122GB 16 vCPU’s 16 vCPU’s r3.8xlarge
  27. QPI QPI QPIQPI QPI 488GB 488GB 488GB 488GB 32 vCPU’s 32 vCPU’s 32 vCPU’s 32 vCPU’s x1.32xlarge
  28. Tip: Kernel Support for NUMA Balancing  An application will perform best when the threads of its processes are accessing memory on the same NUMA node.  NUMA balancing moves tasks closer to the memory they are accessing.  This is all done automatically by the Linux kernel when automatic NUMA balancing is active: version 3.8+ of the Linux kernel.  Windows support for NUMA first appeared in the Enterprise and Data Center SKUs of Windows Server 2003.  Set “numa=off” or use numactl to reduce NUMA paging if your application uses more memory than will fit on a single socket or has threads that move between sockets
  29. Operating Systems Impact Performance  Memory intensive web application  Created many threads  Rapidly allocated/deallocated memory  Comparing performance of RHEL6 vs RHEL7  Notice high amount of “system” time in top  Found a benchmark tool (ebizzy) with a similar performance profile  Traced it’s performance with “perf”
  30. On RHEL6 [ec2-user@ip-172-31-12-150-RHEL6 ebizzy-0.3]$ sudo perf stat ./ebizzy -S 10 12,409 records/s real 10.00 s user 7.37 s sys 341.22 s Performance counter stats for './ebizzy -S 10': 361458.371052 task-clock (msec) # 35.880 CPUs utilized 10,343 context-switches # 0.029 K/sec 2,582 cpu-migrations # 0.007 K/sec 1,418,204 page-faults # 0.004 M/sec 10.074085097 seconds time elapsed
  31. RHEL6 Flame Graph Output
  32. On RHEL7 [ec2-user@ip-172-31-7-22-RHEL7 ~]$ sudo perf stat ./ebizzy-0.3/ebizzy -S 10 425,143 records/s real 10.00 s user 397.28 s sys 0.18 s Performance counter stats for './ebizzy-0.3/ebizzy -S 10': 397515.862535 task-clock (msec) # 39.681 CPUs utilized 25,256 context-switches # 0.064 K/sec 2,201 cpu-migrations # 0.006 K/sec 14,109 page-faults # 0.035 K/sec 10.017856000 seconds time elapsed Up from 12,400 records/s! Down from 1,418,204!
  33. RHEL7 Flame Graph Output
  34. Hugepages  Disable Transparent Hugepages # echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled # echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag  Use Explicit Huge Pages $ sudo mkdir /dev/hugetlbfs $ sudo mount -t hugetlbfs none /dev/hugetlbfs $ sudo sysctl -w vm.nr_hugepages=10000 $ HUGETLB_MORECORE=yes numactl --cpunodebind=0 --membind=0 /path/to/application See also:
  35. Hardware Split Driver Model Driver Domain Guest Domain Guest Domain VMM Frontend driver Frontend driver Backend driver Device Driver Physical CPU Physical Memory Storage Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application 1 23 4 5
  36. Granting in pre-3.8.0 Kernels  Requires “grant mapping” prior to 3.8.0  Grant mappings are expensive operations due to TLB flushes SSD Inter domain I/O: (1) Grant memory (2) Write to ring buffer (3) Signal event (4) Read ring buffer (5) Map grants (6) Read or write grants (7) Unmap grants read(fd, buffer,…) I/O domain Instance
  37. Granting in 3.8.0+ Kernels, Persistent and Indirect  Grant mappings are set up in a pool one time  Data is copied in and out of the grant pool SSD read(fd, buffer…) I/O domain Instance Grant pool Copy to and from grant pool
  38. Validating Persistent Grants [ec2-user@ip-172-31-4-129 ~]$ dmesg | egrep -i 'blkfront' Blkfront and the Xen platform PCI driver have been compiled for this kernel: unplug emulated disks. blkfront: xvda: barrier or flush: disabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvdb: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvdc: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvdd: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvde: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvdf: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvdg: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvdh: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled; blkfront: xvdi: flush diskcache: enabled; persistent grants: enabled; indirect descriptors: enabled;
  39. 2009 – Longer ago than you think  Avatar was the top movie in the theaters  Facebook overtook MySpace in active users  President Obama was sworn into office  The 2.6.32 Linux kernel was released Tip: Use 3.10+ kernel  Amazon Linux 13.09 or later  Ubuntu 14.04 or later  RHEL/Centos 7 or later  Etc.
  40. Device Pass Through: Enhanced Networking  SR-IOV eliminates need for driver domain  Physical network device exposes virtual function to instance  Requires a specialized driver, which means:  Your instance OS needs to know about it  EC2 needs to be told your instance can use it
  41. Hardware After Enhanced Networking Driver Domain Guest Domain Guest Domain VMM NIC Driver Physical CPU Physical Memory SR-IOV Network Device Virtual CPU Virtual Memory CPU Scheduling Sockets Application 1 2 3 NIC Driver
  42. Elastic Network Adapter  Next Generation of Enhanced Networking  Hardware Checksums  Multi-Queue Support  Receive Side Steering  20Gbps in a Placement Group  New Open Source Amazon Network Driver
  43. Network Performance  20 Gigabit & 10 Gigabit  Measured one-way, double that for bi-directional (full duplex)  High, Moderate, Low – A function of the instance size and EBS optimization  Not all created equal – Test with iperf if it’s important!  Use placement groups when you need high and consistent instance to instance bandwidth  All traffic limited to 5 Gb/s when exiting EC2
  44. EBS Performance  Instance size affects throughput  Match your volume size and type to your instance  Use EBS optimization if EBS performance is important
  45.  Choose HVM AMIs  Timekeeping: use TSC  C state and P state controls  Monitor T2 CPU credits  Use a modern Linux OS  NUMA balancing  Persistent grants for I/O performance  Enhanced networking  Profile your application Summary: Getting the Most Out of EC2 Instances
  46.  Bare metal performance goal, and in many scenarios already there  History of eliminating hypervisor intermediation and driver domains  Hardware assisted virtualization  Scheduling and granting efficiencies  Device pass through Virtualization Themes
  47. Next Steps  Visit the Amazon EC2 documentation  Launch an instance and try your app!
  48. Thank You!
  49. Remember to complete your evaluations!

Hinweis der Redaktion

  1. Let’s start at the basics What is an EC2 instance? They are virtual machines Guests On a Hypervisor On physical hardware
  2. Presentation built to be a deep dive Going into the depths of how EC2 works Highlight actionable things Get the most performance out of your instances Talk a bit about how to choose an EC2 instance When it comes to performance Making sure you’re picking the right one is as important as the tuning tips that I’m also going to talk about
  3. EC2 is a big subject Talk about the Purchase Options APIs & SDK’s Networking Talk today about the instances themselves How they operate Features Options when you go to launch Other Topics List Recommended sessions at the end
  4. Let’s start at the basics What is an EC2 instance? They are virtual machines Guests On a Hypervisor On physical hardware
  5. Launched in 2006 “an instance” Didn’t have a name Did get any choices Like the Model T – any color as long as black Eventually gave it a name M1 instance Customers wanted more choice
  6. We’ve been iterating and growing ever since. Not only adding instances, but changing how EC2 works Launched the cc2 in 2011 Placement groups Bandwidth and latency Hardware assisted virtualization Exposes more of the underlying hardware Lets you get even more performance EC2 is always growing and changing based on customer feedback Always check our documentation for the latest as you’re building out How we do things today may be different in the future
  7. Go over how we talk about instances and name them Get on the same page First letter is the family Stands for what it’s suited for or what resources it has C for compute R for Ram I for IOPS Number is the generation Like a version number Last is the instance size T-Shirt size
  8. You’ve got a lot of choices and flexibility when you go to launch It can seem overwhelming Trying to pick the right instance Looking at just the families First find what your application is constrained by If you need memory, start with R3 CPU, go with C4 If balanced, look at general purpose M or T Perspective of your constraint, it’s easy to pick the right family Test to find the right size within that family If you need a little help, check the documentation List of workloads for each family.
  9. When you’re looking at instances, you’ll see something call vCPUs On modern instances not in the T family An hyperthreaded core Hyperthreading is great to increase performance I kinda lets your CPU do two things at once Like waiting on IO Real core count Divide by two Visit link, used for licensing.
  10. To give a visual representation… Output of LSTOPO on m4.10xlarge Linux utility for enumerating hardware Can run on any instance or physical server Shows graphical output of hardware configuration Sockets Memory on each socket L1-3 Cache CPU thread to core mapping Case of m4.10xlarge 40 threads 20 cores
  11. Some applications don’t benefit from hyperthreading Context switching may decrease performance Typically compute heavy apps Financial calculations & engineering simulations These apps usually disable hyperthreading If you’re not sure or don’t typically disable hyperthreading, don’t worry If you do, try to run with it disabled it on EC2 and see if it improves performance Easy on Linux, harder on Windows Linux The first set of threads on each cores is listed first, and the second or B threads are listed after that Disable the last half, which will be all the B threads Two ways Online Great for no reboot But it may cause instability Disable processors where threads may be running Won’t be persisted after a reboot In grub, Set max cpus to match physical cpu count minus 1 Safer – disabled when booting But makes it harder when you switch size Windows is harder Interleaved Have to use CPU affinity
  12. Same m4.10xlarge with hyperthreading turned off Only one CPU thread per core Comapred to the two that you saw earlier
  13. Let’s dig into how instance sizes work We build instances Easy to scale vertically and horizontally Look at the c4 family as an example C4.8xlarge on the left Largest instance size available That single c4.8xlarge Roughly equal to 2 c4.4xlarges That c4.4xlarge has roughly half the Number of vCPUS Amount of ram Available network bandwidth Keeps following down the line 2x c4.4xlarge = 4x c4.2xlarge And so on…
  14. Reason is because of how we partition instances Largest size is typically a full server On the smaller one’s you’re running a fraction of it depending on the size Virtualization historically has a bad reputation Usually used to manage over utilization of resources More virtual machines than physical resources We use virtualization for a lot of other reasons Security & Isolation Dedicate specific resources to specific customers vCPUS as an example With exception of T When you’re assigned a vCPU only customer using it Not sharing with anyone else on the box Same applies to Memory & Network We build with the goal of providing a consistent experience No matter what else is happening
  15. Last thing I want to say about choosing your instance Cheesy to quote documentation Good sentiment Easy to get an app up and running Don’t run synthetic benchmarks Install your app and send some realistic load Examples: Mobile App HPC application BI database Use a real workload to understand how your app will behave
  16. Digging deeper into the OS… On all systems, time keeping is important Used for things like Processing interrupts Getting the time and date Measuring performance Most AMI’s on AWS use Xen clock by default Compatible with all instance types TSC was introduced in Sandy Bridge Handled by bare metal You’re talking to your processor Not the hypervisor And because of this, calls to it are going to be much quicker
  17. To demonstrate this – simple application It does two things Performs a large number of get time of day calls a bit of math Don’t laugh at my code… I’m a sysadmin, not a developer Quick and dirty to test it out
  18. These are results on Xen clock source Profiled with Strace Really great tool to use with any app, yours included Shows the number of system calls make & the time they took Gettimeofday take the most amount of time with a lot of calls Overall, the test took about 12 seconds to run with Xen clock source
  19. On the same system Switched clocksource to TSC Reran the test Results look a lot different Gettimeofday doesn’t show up Run time reduced to two seconds This is extreme for a simple app I’ve seen apps improve by as much as 40%
  20. It’s an easy change to make on Linux Do it while the system is running First command shows available clock sources Second shows the current clock source Third would change it to TSC On windows, it’s handled automatically If you’re running a recently released EC2 instance Can improve a lot of apps JVM debugging Performance tracing SAP applications
  21. Recently change to the platform added P & C state control to the platform with C4, now available on many more First, let’s talk about C states C states control the power savings features of a processor Using c4.8xlarge as an example Base clock speed of 2.9Ghz Can turbo up to 3.5Ghz on one or two cores Must let other cores idle down Great when you need a few cores to have high frequencies Letting them idle down increase the time it takes for them to respond when you want to actually use them So if you have an application where latency is important You can limit how deep they’ll sleep Setting cstate parameter in grub
  22. You can use P state to set the desired running frequency of the cores Some customers and some workloads consistency is more important than performance Some Game servers good example Operate in loops Loop needs to complete in the same time, every time You can set the P state to prevent the processor to prevent it from scaling up and down Operates at the same frequency all the time
  23. Next I want to talk about T2 and why they’re special T2 instance are great general purpose instances Lowest costs instance available on AWS at ~1/2 a cent per hour for t2.nano Great for workloads where CPU demand varies over time Websites Developer environments Small database You start with a baseline level of performance That you can see in the chart above The magic of T2 is that you earn credits when the instance is idle Allows you to burst above the baseline We launched T2 because we saw that most workloads aren’t using 100% of CPU all of the time T2 family is a great way to Still get the performance you need when you need it Don’t pay for it when you don’t
  24. Let’s talk about how credits work You can think of credits in a T2 like a bucket When you boot the instance Start with enough credits for OS & Application When your app is up and running, you’ll use credits when you use CPU A single credit will let you run 100% of one core for one minute When the work dies down and instance becomes idle Earning new credits that will start to file up the bucket Credits also expire after 24 hours if unused
  25. To monitor those instances Cloudwatch Metrics Two available The one in Orange is the credit usage Spikes when usage is high Shows you how many credits you’re using per minute The Blue is the Balance Keep this above zero if you want more performance than baseline Monitoring your credit balance lets you ensure you’re getting consistent performance on a T2 What you’ll want to hook on if you’re using autoscaling
  26. Recently launched the X1 Biggest instance 2TB of RAM 128 Virtual CPUs Great for apps that need a huge memory footprint Good for: In memory databases big data processing some HPC
  27. When you have that much memory Managing is important On any system with multiple sockets Memory attached to local socket will be faster than remote Concept is called NUMA On Intel, there’s a QPI between sockets It’s the bus that transfer memory from one to another
  28. Look at the r3.8xlarge as an example Two sockets 122GB of ram on each socket Between are 2 QPI links Application on the left reading from the right Will go over the QPI Fast, but not as fast as what’s attached directly
  29. When you go to X1, things are more complex X1 is a 4 socket system Numa is more important Compared to an r3.8xlarge More memory per socket Only one QPI between sockets Memory transfers from one zone to another are going to take longer on X1
  30. So what can we do? If you’ve ever watched top on a linux system shows threads moving from one core to another Process scheduling to make sure work is balanced Around 3.8 kernel, started to use NUMA affinity Will try to keep processes in same NUMA zone Will also try to move memory around to be close to the process The downside is that this can actually slow down performance on some apps Especially true if you have a large memory pool spanning sockets The scheduler will be moving things around when it doesn’t need to be To disable, set NUMA=off in grub will disable memory transfers between zones disable NUMA awareness for process scheduling Alternative is to use numactl to lock processes to a specific zone Only be reading and writing memory that’s local to them
  31. Another thing to keep in mind Operating system and libraries can effect application performance Use not is running a modern linux kernel important Run as recent of a distro as you can Recent customer visit Custom Application using a large amount of memory EC2 performance not as good as on premise Their app was very complex and it was hard to get quick results when making changes Found a benchmark tool (ebizzy) with similar behavior to test
  32. Results of ebizzy on RHEL6 Used perf to profile and see what’s happening at a system level Generated 12,000 requests/second Lots of time in system space 1.5 million page faults
  33. Generated flame graphs to see what’s happening Created by Brendan Gregg, check out his site for more information A really good way to understand Paths the code is taking a system Time spent in specific calls You can see ebizzy on the bottom Making lots of madvise calls End up with a xen hypercall
  34. Compiled the same app on RHEL7 and tested on same instance type Saw significantly better performance RPS went from 12,000 – 425,000 Page faults went from 1.5 million to only 14,000
  35. What happened? This is where flamegraphs really shine Same exact flame graph Same Code Sam run type Only difference is the OS version What the flamegraph showed us Glibc Changed the path memory calls on RHEL7 Instead of long madvase with trip to hypervisor Single intel optimized call for memory management Recompile when moving to a different OS, it can make a big difference
  36. Last memory related tip is to Disable transparent huge pages Huge pages are a really big subject with a lot of different options See article It goes into detail about all the different options Transparent is enabled by default on most recent distributions Disabling transparent and using explicit Can help significantly for apps that are accessing a lot of memory
  37. Next, let’s talk about IO We have a few families that are optimized for IO I2 – IOPS – SSD Based D2 – Dense storage - Magetic Need a modern kernel to get best storage performance Reason is split driver model Application on left doing some disk IO Talks to the front end driver Then back end Then real driver Then hardware Data transfer happens through shared pages Need permissions to be granted and released
  38. Granting had lots of overhead in early kernels Every time it needs to write to disk Talks to VMM Get permission to write to device Fill a buffer with the data Pass to backend Wait for data to be written Remove the grant Really expensive process, lots of buffer flushing Gets worse the more CPUs you have
  39. Persistent grants created to solve this. Permission to write is reused for all transactions between front and back Grants don’t need to be unmapped Translation buffer never flushed Much better performance for IO operations
  40. Validating grants is easy Run dmesg and grep for blockfront This is i2.8xlarge All volumes have persistant grants enabled.
  41. If I haven’t said it enough Using a modern kernel is really important Many customers still use Centos6 Just by switching OS’s to 3.10 Seen as much as a 60% improvement 2.6 Kernel in Centos6 released in 2009 Long time ago in the cloud computing world Please use a modern Kernel & OS.
  42. Same lines as split driver model Release enhanced networking with C3 Uses Single Root IO Virtualization – SRIOV Physical device exposed to instance itself Has a few requirements Needs a special driver installed in the OS EC2 needs to be told to expose it that way
  43. Network path is much simpler Packets don’t’ have to go through the VMM Higher packets per second Decreased jitter – talking to bare metal It’s free on all supported instance Enabled by default in many AMI’s Highly recommend it if you’re touching the network
  44. And we’re not done improving the network Still making constant improvements Latest is with a new Network Adapter Launched with the X1 Called Elastic Network Adapter – ENA Built a new Amazon developed Open source Driver Will grow with us as we’re adding new features to the network Built to handle throughputs up to 20 Gigabits/second This + Hardware checksums & RSS make it the fastest network available on AWS today.
  45. Touch briefly on network Touch on a few points about network performance Attend the Deep dive to learn more Easy to forget that network can be a bottleneck on smaller instance types Customer doing S3 performance testing Not getting good performance Found out all network traffic was going through a T2 NAT Largest instances should get closer to 5Gbps when leaving EC2 and talking to things like S3 When we list 10 & 20 Gigabit bandwidth Instance bandwidth is bi-directional On p2, X1, and m4 20Gb in and Out at the same time But you need placement groups and multiple TCP streams
  46. Just like network throughput, EBS throughput is function of size of the instance Larger instance, more EBS traffic EBS optimization by default on newest instances Don’t have to worry about network and EBS competing Look at the EBS documentation Table of every EBS optimized instance Throughput and max IOPS Great place to go to look for specific performance out of EBS
  47. In conclusion Lots of things Getting the most out of it At bare minimum Benchmark your app Use a modern OS Monitor Cloudwatch Use enhanced networking
  48. Goal is to make virtualization as transparent as possible Eliminate any inefficiencies it may cause Goal of bare metal like performance Already there in a lot of ways
  49. So if you have any questions, the EC2 documentation is a great resource and covers even more than I could today. Otherwise, launch an instance and start testing your app. Thank you!
  50. So if you have any questions, the EC2 documentation is a great resource and covers even more than I could today. Otherwise, launch an instance and start testing your app. Thank you!