Maintaining a constantly updated large data set alone is a big challenging not only to database administrators but also to developers as it is hard to maintain and expand. It adds more stress when the requirement is to serve real time data to heavy traffic websites.
In this presentation, we first examine the initial characteristics of AOL’s Real Time News system, the design strategy, and how MySQL fits into the overall architecture. We then review the issues encountered and the solutions applied when the system characteristics changed due to ever growing data set size and new query patterns.
In addition to common MySQL design, trouble-shooting, and performance tuning techniques, we will also share a heuristic algorithm implemented in the application servers to reduce the response time of complex queries from hours to a few milliseconds.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Building and deploying large scale real time news system with my sql and distributed cache mysql_conf
1. Building and Deploying Large Scale
Real Time News System with
MySQL and Distributed Cache
Presented
to
MySQL
Conference
Apr.
13,
2011
2. Who am I?
Pag
e2
Tao Cheng <tao.cheng@teamaol.com>, AOL Real
Time News (RTN).
Worked on Mail and Browser clients in the ‘90 and
then moved to web backend servers since.
Not an expert but am happy to share my experience
and brainstorm solutions.
Presentation for
[CLIENT]
3. Agenda
AOL Real Time News (RTN): what it is?
Requirements
Technical solutions with focus on MySQL
Deployment Topology
Operational Monitoring
Metrics Collection
4. Agenda
Tips for query tuning and optimization
Heuristic Query Optimization Algorithm
Lessons learned
Q & A
5. Real Time News : background
Pag
e5
AOL deployed its large scale Real Time News (RTN)
system in 2007.
This system ingests and processes news from 30,000
sources on every second around the clock. Today, its
data store, MySQL, has accumulated over several
billions of rows and terabytes of data.
However, news are delivered to end users in close to
real time fashion. This presentation shares how it is
done and the lessons learned.
Presentation for
AOLU Un-University
6. Brief Intro: sample features
Pag
e6
Data presentation: return most recent news in
flat view – most recent news about an entity. An entity could
be a person, a company, a sports team, etc.
topic clusters – most recent news grouped by topics. A topic is
a group of news about an event, headline news, etc.
News filtering by
source types such as news, blogs, press releases, regional, etc.
relevancy level (high, medium, low, etc) to the entities .
Data Delivery: push (to subscribers) and pull
Search by entities, categories (National, Sports,
Finance, etc), topics, document ID, etc.
Presentation for
[CLIENT]
7. Requirements for Phase I (2006)
Pag
e7
Commodity hardware: 4 CPU, 16 GB MEM, 600 GB
disk space.
Data ingestion rate = 250K docs/day; average
document size = 5 KB.
Data retention period: 7 days to forever
Est. data set size: (1.25 GB/day or 456 GB/year) +
space for indexes, schema change, and optimization.
Response time: < 30 milli-second/query
Throughputs: > 400 queries/sec/server
Up time: 99.999%
Presentation for
[CLIENT]
8. Solutions: MySQL + Bucky
Pag
e8
MySQL
Serve raw/distinct queries
Back fill
Bucky Technology (AOL’s distributed cache &
computing framework)
Write ahead cache: pre-compute query results and push them
into cache.
Messaging (optional): push data directly to subscribers
Updatesare pushed to data consumers or browsers via AIM
Complex.
Updates go to both database and cache.
Presentation for
[CLIENT]
10. Data Model: SOR v.s. Query DB
Pag
e 10
Separate query from storage to keep tables small and
query fast.
System of Record (SOR): has all raw data
The authoritative data store; designed for data storage
Normalized schema: for simple key look-up; no table join.
Query DB – de-normalized for query speed
avoid JOIN, reduce # of trips to DB, increase throughputs.
Read/write small chunk of data at a time so database
can get requests out quickly and process more.
Use replication to achieve linear scalability for read.
Presentation for
[CLIENT]
11. Design Strategies: partitioning (Why)
Pag
e 11
Dataset too big to fit on one host
Performance consideration: divide and conquer
Write: more masters (Nx) to take writes
Read: smaller tables + more (NxM) slaves to handle read.
Fault tolerance – distribute the risk and reduce the
impact of system failure
Easier Maintenance – size does matter
Faster nightly backup, disaster recovery, schema change, etc.
Faster optimization –need optimization to reclaim disk space
after deletion, rebuild indexes to improve query speed.
Presentation for
[CLIENT]
12. Design Strategies: partitioning (How)
Pag
e 12
Partition on most used keys (look at query patterns)
Document table – on document ID
Entity table – on entity ID
Simple hash on IDs – no partition map; thus no
competition of read/write locks on yet another table
Managing growth: add another partition set
New documents are written into both old and new partition
sets for a few weeks. Then, stop writing into the old partitions.
Queries go to the new partitions first and then the old ones if
in-sufficient results found.
Works great in our case but might not for everyone.
Presentation for
[CLIENT]
13. Schema design: De-normalization
Pag
e 13
Make query tables small:
put only essential attributes in the de-normalized tables
store long text attributes in separate tables.
De-normalization: how to store and match attributes
Single value attributes (1:1) : document ID, short string, date
time, etc. – one column, one row.
Multi-value attributes (1:many): tricky but feasible
Use multiple rows with composite index/key: (c1, c2, etc.)
One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val
like ‘%id2%’”
One row but multiple columns, e.g., group1, group2, etc. – SQL:
group1=val1 OR group2=val2 ...
Presentation for
[CLIENT]
14. Tips for indexing
Pag
e 14
Simple key – for metadata retrieval
Composite key – find matching documents
Start with low cardinality and most used columns
Order matter: (c1, c2, c3) != (c2, c3, c1)
InnoDB – all secondary indexes contain primary key
Make primary key short to keep index size small
Queries using secondary index references primary key too.
Integer v.s. String – comparison of numeric values is
faster => index hash values of long string instead.
Index length – title:varchar(255) => idx_title(32)
Enforce referential integrity on application side.
Presentation for
[CLIENT]
15. MySQL configuration
Pag
e 15
Storage engine: InnoDB – row level locking
Table space – one file per table
Easier to maintain (schema change, optimization, etc.)
Character set: ‘UTF-8’
Disable persistent connection (5.0.x)
skip-character-set-client-handshake
Enable slow query log to identify bad queries.
System variables for memory buffer size
innodb_buffer_pool_size: data and indexes
Sort_buffer_size, max_heap_table_size, tmp_table_size
Query cache size=0; tables are updated constantly
Presentation for
[CLIENT]
16. Runtime statistics (per server)
Pag
e 16
Average write rate:
daily: < 40 tps
max at 400 tps during recovery
Perform best when write rate < 100 tps
Query rate: 20~80 qps
Query response time – shorter when indexes and
data are in memory
75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60
95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60
CPU Idle %: > 99%.
Presentation for
[CLIENT]
18. Deployment Topology Consideration
Pag
e 18
• Minimum configuration: host/DC redundency
• DC1: host 1 (master), host 3 (slave)
• DC2: host 2 (failover master), host 4 (slave)
• Data locality: significant when network latency is a
concern (100 Mbps)
• 3,000 qps when DB is on remote host.
• 15,000 qps when DB is on local host.
• Linking dependent servers across data centers
• Push cross link up as far as possible (Topology 3): link to
dependent servers in the same data center.
Presentation for
[CLIENT]
19. Deployment Topology 1: minimum config
Pag
e 19
Date Center 1
DB DB
Data WWW
Consumer
DB DB
Date Center 2
Presentation for
[CLIENT]
20. Topology 2: link across DCs (bad)
Pag
e 20
Data
DB V V
DB Consumer
I I
P P
Data
DB Consumer G
S
L WWW
GSLB
B
Data
DB V V
Consumer
I I
DB P P
Data
DB
Consumer
Presentation for
[CLIENT]
21. Topology 3: link to same DC (better)
Pag
e 21
Data
DB V V
DB Consumer
I I
P P
Data
DB Consumer G
S
L WWW
B
Data
DB V V
Consumer
I I
DB P P
Data
DB
Consumer
Presentation for
[CLIENT]
22. Topology 4: use local UNIX socket
Pag
e 22
Data
DB V
DB Consumer
I
P
Data
DB Consumer G
S
L WWW
B
Data
DB Consumer V
I
DB P
Data
DB
Consumer
Presentation for
[CLIENT]
23. Production Monitoring
Pag
e 23
Operational Monitoring: logcheck, Scout/NOC alert,
etc.
DB monitoring on replication failure, latency, read/
write rate, performance metrics.
Presentation for
[CLIENT]
24. Metrics Collection
Pag
e 24
Graphing collected metrics: visualize and collate
operational metrics.
Help analyzing and fine tuning server performance.
Help trace production issues and identify point of failure.
What metrics are important?
Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU
swap/paging
Server: Throughputs, response time
Comparison: line up charts (throughputs, response
time, CPU, disk i/o) in the same time window.
Presentation for
[CLIENT]
28. Tuning and Optimizing Queries
Pag
e 28
Explain: mysql> explain SELECT ... FROM …
Watch out for tmp table usage, table scan, etc.
SQL_NO_CACHE
MySQL Query profiler
mysql> set profiling=1;
Linux OS Cache: leave enough memory on host
USE INDEX hint to choose INDEX explicitly
use wisely: most of the time, MySQL chooses the right index
for you. But, when table size grows, index cardinality might
change.
Presentation for
[CLIENT]
29. Important MySQL statistics
Pag
e 29
SHOW GLOBAL STATUS…
Qcache_free_blocks
Qcache_free_memory
Qcache_hits
Qcache_inserts
Qcache_lowmem_prunes
Qcache_not_cached
Qcache_queries_in_cache
Select_scan
Sort_scan
Presentation for
[CLIENT]
30. Important MySQL statistics (cont.)
Pag
e 30
Table_locks_waited
Innodb_row_lock_current_waits
Innodb_row_lock_time
Innodb_row_lock_time_avg
Innodb_row_lock_time_max
Innodb_row_lock_waits
Select_scan
Slave_open_temp_tables
Presentation for
[CLIENT]
31. Heuristic Query Optimization Algorithm
Pag
e 31
Primary for complex cluster queries: find latest N
topics and related stories.
Strategy: reduce the number of records database
needs to load from disk to perform a query.
Pick a default query range. If in-sufficient docs are returned,
expand query range proportionally.
If none return => sparse data => drop the range and retry.
Save query range for future references.
Result: reduce number of rows needed to process
from millions to hundreds => cut query time down
from minutes to less than 10 ms.
Presentation for
[CLIENT]
32. Query
range
Cluster
query
look
up
NumOfTripToDB
=0
no
Has query Use default
range? range
Compute docs to range ratio and
prorate it to a range that would return
sufficient amount of docs.
Bound query with the
range and send it to
DB yes
NumOfTrip
ToDB
>=2?
NumOfTripToDB++
Suf@icient
yes
results
numOfResults Send original
from
== 0? query to DB
query
engine?
Query
Engine
yes
Compute docs to range
ratio and save it back Return query
to the look up table for results to clients.
future use.
Presentation for
[CLIENT]
33. Lessons Learned
Pag
e 33
Always load test well ahead of launch (2 weeks) to
avoid fire drill.
Don’t rely on cache solely. Database needs to be able
to serve reasonable amount of queries on its own.
Separate cache from applications to avoid cold start.
Keep transaction/query simple and return fast.
Avoid table join; limit it to 2 if really needed.
Avoid stored procedure: results are not cached; need
DBA when altering implementation.
Presentation for
[CLIENT]
34. Lessons Learned (cont.)
Pag
e 34
Avoid using ‘offset’ in LIMIT clause; use application
based pagination instead.
Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT
If possible, exclude text/blob columns from query
results to avoid disk I/O.
Store text/blob in separate table to speed up backup,
optimization, and schema change.
Separate real time v.s. archive data for better
performance and easier maintenance.
Keep table size under control ( < 100 GB) ; optimized
periodically.
Presentation for
[CLIENT]
35. Lessons Learned (cont.)
Pag
e 35
Put SQL statement (templates) in resource files so
you can tune it without binary change.
Set up replication in dev & qa to catch replication
issues earlier
Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above)
Auto-increment + (INSERT.. ON DUPLICATE UPDATE…)
Date time column: default to NOW()
Oversized data: increase max_allowed_packet
Replication lag: transactions that involve index update/
deletion often take longer to complete.
Host and data center redundancy is important –
don’t put all eggs in one basket.
Presentation for
[CLIENT]
36. RTN 3 Redesign
Pag
e 36
Free Text Search with SOLR
Real time v.s. archive shards.
1 minute latency w/o Ramdisk.
Asset DB partitioned – 5 rows/doc -> 25 rows/doc
Avoid (System) Virtual Machine; instead, stack high
end hosts with processes that use different system
resources (CPU, MEM, disk space, etc)
Better network and system resource utilization – cost effective.
Data Locality
More processors (< 12 ) help when under load.
Presentation for
[CLIENT]
37. Q&A
Pag
e 37
Questions or comments?
Presentation for
[CLIENT]
38. Pag
e 38
THANK YOU !!
Presentation for
[CLIENT]