TiDB and Amazon Aurora can be combined to provide analytics on transactional data without needing a separate data warehouse. TiDB Data Migration (DM) tool allows migrating and replicating data from Aurora into TiDB for analytics queries. DM provides full data migration and incremental replication of binlog events from Aurora into TiDB. This allows joining transactional and analytical workloads on the same dataset without needing ETL pipelines.
3. Why Aurora?
â Amazon Aurora is popular
â Amazon Aurora is designed for OLTP workload
â Amazon Aurora is still using the MySQL code base
â It would be cool to do some lightweight OLAP queries directly on the transactional
database.
â No need to involve data warehouse!
â Example: a dashboard for each tenant in SaaS applications
4.
5.
6. Why Aurora is fast
â Fewer I/O, fewer write ampliïŹcation
â Smaller network packets
â Eventual consistency
HyPer
7. Pros:
â 100% MySQL compatibility
â Fully-managed
â Scalable read
Cons:
â Single point write (if you want to scale out writer, you still need sharding)
â SQL layer is not designed for complex query
â Reader is eventual consistency
â Memory size and storage size is not proportional
Aurora Pros & Cons
8. â Read >> Write, and all the writes can handle by one node
â Data size is not too large
â Reason: Performance will have signiïŹcant ïŹuctuations, if the gap of memory and actual
storage size is too large
â Reader doesnât require stong consistency
â Migrate from legacy MySQL applications
Aurora typical scenarios
9. Whatâs TiDB
TiKV Node N
Store N
TiKV Node 1
Store 1
TiKV Node 2
Store 2
...
TiDB Node TiDB Node TiDB Node
Key-Value or Co-processor API calls
...
...
... TiDB servers, stateless, SQL engine
PD
PD PD
MySQL/MariaDB clients, ORMs, JDBC/ODBC, Applications ...
MySQL Wire Protocol
heartbeat
meta
10. TiDB is not a database middleware
Sharding middleware TiDB
ACID Transaction Support Mostly
Elastic Scaling
Complex Query (Join, Sub query,
GROUP BY)
Failover Manual Auto
MySQL Compatibility Low High
Max Capacity (Good performance) Few TBs 200 TB+
12. Storage Physical Stack
Highly layered
TiKV
API (gRPC)
Transaction
MVCC
Multi-Raft (gRPC)
RocksDB
Raw KV API
Transactional KV API
13. TiKV Node
Store 1
Region 1
Region 2
Region 3
Region 4
Local RocksDB instance
t1_r1 v1
t1_r2 v2
... ...
t5_r1 ...
t5_r10 ...
t1_i1_1_1 ...
t1_i1_2_2 ...
... ...
t1_i6_1_3 ...
... ...
Region 1
Region 2
Region 3
Region 4
Data organization within TiDB
...
14. Data organization within TiDB
â Region (A bunch of key-value pairs, or Split)
â Default total size of one Region: 96MB
â You can change it in conïŹguration
â Ordered
â Region is a logical concept
â Region meta : [Start key, End key)
â All Regions within a TiKV node share the same
RocksDB instance
â Each Region is a Raft group
â Default: 3 replicas
TiKV Node
âŠ...
Region 1
Region 2
Region 3
Region 4
15. How scale-out works inside TiDB
TiKV Node 3
Store 3
TiKV Node 1
Store 1
TiKV Node 2
Store 2
Region 1*
Region 1
Region 1
Letâs say, the amount of data within Region 1
exceeds the threshold (default: 96MB)
PD
PD PD
16. How scale-out works inside TiDB
TiKV Node 3
Store 3
TiKV Node 1
Store 1
TiKV Node 2
Store 2
Region 1*
Region 1
Region 1
Letâs say, the amount of data within Region 1
exceeds the threshold (default: 96MB)
PD
PD PD
I think I should split up Region 1
17. How scale-out works inside TiDB
TiKV Node 3
Store 3
TiKV Node 1
Store 1
TiKV Node 2
Store 2
Region 1*
Region 1
Region 1
Region 2*
Region 2
Region 2
Region 1 will be split into two smaller Regions.
(the leader of Region 1 sends a Split command as a
special log to its replicas via the Raft protocol.
Once the Split command is successfully committed by
Raft, that means the Region has been successfully split.)
PD
PD PD
18. How scale-out works inside TiDB
TiKV Node 3
Store 3
TiKV Node 1
Store 1
TiKV Node 2
Store 2
Region 1*
Region 1
Region 1
Region 2*
Region 2
Region 2
TiKV Node 4
Store 4
PD
PD PD
PD: âHey, Node1, create a new replica of Region 2
in Node 4, and transfer your leadership of Region 2
to Node 2â
19. How scale-out works inside TiDB
TiKV Node 3
Store 3
TiKV Node 1
Store 1
TiKV Node 2
Store 2
Region 1*
Region 1
Region 1
Region 2
Region 2*
Region 2
TiKV Node 4
Store 4
Region 2
PD
PD PD
20. How scale-out works inside TiDB
TiKV Node 3
Store 3
TiKV Node 1
Store 1
TiKV Node 2
Store 2
Region 1*
Region 1
Region 1
Region 2*
Region 2
TiKV Node 4
Store 4
Region 2
PD
PD PD
PD: âOK, Node 1, delete your local replica of Region 2â
22. Pros:
â Scale-out well on both read and write, without manual sharding or specifying sharding rules
â Full-featured SQL layer which is designed for distributed computing
â ACID sementics, transaparent to application layer
â MySQL compatiblity
Cons:
â Not 100% MySQL compatibility (full list here)
â Less economical when data is small
â Extra latency introduced by 2PC
â Relatively complex to deploy than standalone MySQL
TiDB Pros & Cons
23. â High concurrent / large data volume / OLTP application
â Data volume is growing rapidly and is soon to exceed the capacity of a single machine
â Typical size: 1 TB ~ 200 TB
â Migration from MySQL for your existing applications
â No obvious access hotspot
â SaaS applications
â No obvious sharding key, application needs to support multi-dimensional SQL query
â Synchronize multiple MySQL masters in real time via binlog, and using SQL for real-time data
analysis in TiDB
TiDB typical scenarios
46. Analytics on Read-Replica?
SELECT 100.00 * SUM(CASE WHEN p_type LIKE 'PROMO%' THEN
l_extendedprice * ( 1 - l_discount ) ELSE 0 END) /
SUM(l_extendedprice * ( 1 - l_discount )) AS
promo_revenue
FROM lineitem,
part
WHERE l_partkey = p_partkey
AND l_shipdate >= DATE '1996-12-01'
AND l_shipdate < DATE '1996-12-01' + interval '1'
month;
47. Benchmark - TPC-H
TPC-H scale factor: dbgen -s 10
TiDB variables:
set global tidb_max_chunk_size=1024;
set global tidb_hashagg_partial_concurrency=16;
set global tidb_projection_concurrency=4;
set global tidb_distsql_scan_concurrency=20;
set global tidb_index_lookup_concurrency=20;
set global tidb_index_join_batch_size=25000;
set global tidb_index_lookup_join_concurrency=20;
set global tidb_index_lookup_concurrency=20;
set global tidb_hash_join_concurrency=10;
48. Benchmark - TPC-H
Query duration in second, the lower the better
Result in text
Note:
parallel scanning is not
enabled for Aurora,
because Aurora in China
region doesnât have this
feature yet.
50. TiDB DM: TiDB Data Migration Tool
TiDB DM (Data Migration) is an integrated data replication task management platform that
supports:
â full data migration
â incremental data migration
from MySQL/MariaDB/Aurora MySQL into TiDB
https://github.com/pingcap/dm
https://pingcap.com/docs/tools/dm/overview/
53. Using TiDB DM to replicate data from Aurora
dm-master
dm-worker
dm-worker
dm-worker
54. Demo
1. Launch an empty TiDB cluster
2. Import data to AWS Aurora (TPC-H sample data)
a. http://download.pingcap.org/tispark-sample-data.tar.gz
3. Deploy DM
a. conïŹgure `dm-worker`
b. conïŹgure `dm-master`
4. Launch `dm-master`
5. Launch `dm-worker`
6. use `dmctl` create a synchronization task (AWS Aurora->TiDB)
a. Sample task: https://gist.github.com/c4pt0r/bcbf645ab475e5a603ad3d79d95c1757
7. Run the queries!
a. https://github.com/catarinaribeir0/queries-tpch-dbgen-mysql
55. Check list
â Enable binlog for Amazon Aurora, use âROWâ mode
â https://aws.amazon.com/premiumsupport/knowledge-center/enable-binary-logging-aurora/
â Enable GTID for Amazon Aurora
â https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/mysql-replication-gtid.html#mysql-replication-gtid.co
nfiguring-aurora
â Make sure DM can connect to Amazon Aurora master (check security group rules)
â Ignore some unsupported syntax
â ignore-checking-items: ["dump_privilege", "replication_privilege"]
â example task.yaml: https://gist.github.com/c4pt0r/bcbf645ab475e5a603ad3d79d95c1757
56. â Aurora is a great alternative to MySQL in the case of small and medium data size.
â Due to the problem of single writer, scale-out write still requires middleware or manually sharding
â Aurora is diïŹcult to handle complex queries due to the reuse of MySQL optimizer
â Compared to Aurora, TiDB costs less under large data size
â TiDB supports elastic scaling both read and write
â TiDB has a full-featured SQL designed for distributed computing
â TiDB can synchronize data in real time as Aurora's replica on AWS using TiDB-DM
â TiDB's OLAP capabilities will be a good complement to Aurora
Conclusion