Learn how to monitor your database performance closely and troubleshoot database issues quickly using a variety of features provided by Amazon RDS and MySQL including database events, logs, and engine-specific features. You also learn about the security best practices to use with Amazon RDS for MySQL. In addition, you learn about how to effectively move data between Amazon RDS and on-premises instances. Lastly, you learn the latest about MySQL 5.6 and how you can take advantage of its newest features with Amazon RDS.
11. Error Log
• Archived ever 5 min
• Retained for 24 hours
• Example: Unable to start MySQL
Sample Log content:
InnoDB: Initializing buffer pool, size = 6.0G
InnoDB: Completed initialization of buffer pool
InnoDB: Fatal error: cannot allocate memory for the buffer pool
• Action: Audit mem parameters
(for e.g., innodb_buffer_pool_size)
12. Slow Query Log
• Download from AWS
Management Console
• Access from tables
• Connect the dots
21. Who we are, What we do
• Optaros is a global digital commerce service partner
• Hosting and support for multiple customers
• New and emerging shopping models
–
–
Flash sales
Private event retailing
• High traffic “Daily Deal” sites
–
–
–
–
5 mio. unique visitors
2000 page views/second
15 add to carts per second
3 orders/sec
• Using AWS since 2009, RDS since 2010
22. Private Event Retailing (PER)
•
•
•
•
“Daily Deal” or “Private Sales”
24, 48, or 72 hour events
Massive discounts designed to entice customers
Invitation only
– Customers are selected based on purchase history
– Email blast is sent as the event starts
• Users can “reserve” items for a limited time by
adding them to their cart
• “Cyber Monday every Monday”
25. RDS in E-Commerce
• Highly transactional, ACID is a must
• Highly available
– Multi-AZ: fail-over, on-the-fly changes to RDS instances
• Massive write and read-intensive loads
– Writes: sign-up, add to cart, checkout – Provisioned IOPS
– Reads: catalog browsing, stock availability – read replicas
• Operational efficiency
– High/low peak traffic ratio is huge, sometimes as high as 100:1
– 50+ database servers with 5 devops engineers
26. Tools & Techniques
• Jenkins
– Event prep automation
• CloudFormation
– Environment management
• CloudWatch for metrics
– And Graphite for good measure
• Percona toolkit
– http://www.percona.com/software/percona-toolkit
• MONyog
• Optaros Cloud Console
– Database monitor
28. Jenkins
• We have automated jobs to “Scale up” the
infrastructure:
– Frontend servers – increase auto-scaling array to 30+
– Start up to 10 extra cache machines
– RDS read replicas – start 4 read replicas in parallel
• Jobs complete within 30 minutes – used to take
a lot longer before parallel read replica creation
29. AWS CloudFormation
• Keep your RDS parameter groups, security
groups and network ACLs in sync across
environments
sorin-macbook:stacks sorin$ stack -d cross-client-tools-prod.rb
@@ -7188,7 +7188,7 @@
"innodb_purge_threads": 1,
"max_allowed_packet": 20971520,
"max_connect_errors": "10000",
"query_cache_size": 33554432,
+
"query_cache_size": 65554432,
"thread_cache_size": 32,
"tx_isolation": "READ-COMMITTED”
30. Amazon CloudWatch and Graphite
• Graphite is our central system for metrics
– Pull RDS data from CloudWatch into Graphite
– Parse InnoDB and system variables and push to Graphite
– Application and system metrics go in there as well
• Single dashboard for the whole application
• Graphite’s API is polled by other alerting and
monitoring systems as well
34. MONyog
• Commercial app for MySQL management
• Monitors and alerts on key metrics
• Useful diagnostics
–
–
–
–
Caches
Deadlocks
Temporary tables
etc.
• Advice on best practices
35. MONyog alert
Server: prod rds-read-replica0
Sampling timeframe: All Time/Current
Name
Currently running threads
Group
Current Connections
Type
Critical
Thresho 500
ld
Value
1204
Advice
If the database is overloaded you'll get an increased number of queries running. Occasional spikes are OK for
very short period of time. Too many active threads indicate that:
1. MySQL is taking too much time to process you requests.
2. You are continuously retrieving/updating large datasets.
Make sure that queries are tuned to use indexes. ExecuteSHOW FULL PROCESSLIST of find queries that
are getting locked continuously. Try isolating long running queries by enabling the slow query log.
36. Percona Toolkit
• http://percona.com/software/percona-toolkit
• pt-query-digest in particular
– Can be used on the slow query log or a tcpdump file
– Since you can’t access the RDS instances, you can run it on your application
server
– #tcpdump -i eth0 port 3306 -s 65535 -x -n -q -tttt >
tcpdump.out
– #pt-query-digest --type=tcpdump tcpdump.out
• pt-table-checksum won’t work
–
–
–
–
It requires special privileges
Fortunately, it’s really easy to rebuild read replicas
sync_binlog can be a problem when using read replicas
Less of a problem with MySQL 5.6 crash free slaves
37. In-House Database Monitor
• “Snapshot” InnoDB status and process list every
10 seconds
• Go back in time up to 7 days
• Helps identify contentions, rogue queries, etc.
• Uses Amazon S3 for storage
41. Up Next
•
•
•
•
Manage read replicas using CloudFormation
Use Provisioned IOPS more for lower latency
Upgrade more environments to MySQL 5.6
Better disaster recovery – cross-region DB
snapshot
43. Titans Group
• VAS (Value Added Services) provider for mobile
and fixed-line carriers and ISPs
• White label personal cloud, mobile security and
mobile learning products
• Over 10 million active users in 17 countries in
Latin America
44. Carrier billing platform
• Complex business rules (trial and subscription
periods, bundle, self-renewal)
• Lots of safeguards to prevent overcharge
• High volume, high value data
• Uptime counts: lost transaction is lost revenue
• Transactions concentrated in some days of the
month
• Many different regulatory issues for logging, privacy
and data retention
46. Before
• Single pair of on-premises MySQL servers in
master-slave configuration
• Less than 100k transactions a day but growing
fast
• No full-time DBA
• Rapidly iterating the application (while
converting from PHP to Python)
47. Problems
• Upgrading memory, CPU and storage (SSD)
and still hitting hardware bottlenecks
• Database for queues (please, don't!)
48. The turning point
• AWS announces Provisioned IOPS Storage for
RDS in September 2012
• Let's migrate!
49. Migrating from on-premises to RDS
• Then: dump from MySQL and load on RDS,
replay binary logs on RDS (downtime)
• Percona Toolkit pt-table-sync for sanity checks
• Now (much easier!): RDS as slave, promote
slave to master (almost online)
51. After
• Several RDS instances
• Specialized databases by function (contracts,
transactions, whitelists, blacklists)
• Several million transactions a day and still
growing fast
• Still no full-time DBA
52. How RDS helped us
• Focus on application versus focus on database
operation
• Easy scaling up
• Multi-AZ - High availability (99.95% Uptime SLA)
• Read Replica – for read load and ad hoc analysis
• Snapshots - For testing and archival
• Tagging - Cost reporting by product and client