In this initial meetup of the MySQL, MariaDB, MongoDB talks group (https://www.meetup.com/MySQL-MariaDB-and-MongoDB-talks-CZ/), we introduced our company, our data lake and technologies we use.
Also, we discussed MySQL from scalability point of view, covering replications, cross-DC multi-master setup, Galera cluster and backup. We also described how we migrated to Galera cluster from master-slave setup.
Next talk was about HBase, it's architecture and use cases.
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
MySQL Meetup Prague - Modern Data Lake
1. www.seznam.cz 1 of 67
MySQL / MongoDB Meetup
3.10.2017, Prague
Agenda
- Introduction
- About Seznam and Sklik.cz from DB point of view
- Architecture and scaling of MySQL
- A glimpse into the world of HBase
- MongoDB from the DBA point of view (cancelled, sorry)
Next time (ca 3/2018)
- Call for papers is open!
10. Architecture and scaling of MySQL
Audience: Beginners
Michal Kuchta
Senior developer of Sklik, Seznam.cz
11. www.seznam.cz
Common setup
• LAMP server
• Linux, Apache, MySQL, PHP
• Most common usage
• Everything on single machine
+ Easy to maintain
+ Cheap
- SPOF
- Poor performance under high load
- IO scheduling
- Splitted memory between application and DB
LAMP Server
11 of 67
12. www.seznam.cz
Brute force scaling
• Split database and application
• One machine for all database operations
+ Database on it’s own dedicated hardware
+ Dedicated resources
+ Better optimalization possibilities
- Another server to maintain
- Still SPOF
MySQL Server
Application server
12 of 67
13. www.seznam.cz
Brute force scaling
• Dedicated database server
• 128 – 256 GB RAM
• SSD drives
■ Anyone runs MySQL on HDD today?
• A lot of memory for InnoDB buffer pool - database runs from RAM
■ Anyone uses MyISAM today?
• Price? Around $19.000 per machine
MySQL Server
13 of 67
14. www.seznam.cz
Master
Slave Slave Slave
Horizontal scaling
• Master – Slave replications
• Writes goes to master
• Reads goest to slaves
• Good if you have high read load
• Statement based vs. row based
binlog entrys
+ Better performance for selects
(read scale-out)
+ Hot backup
+ Intentional delay
- Does not scale writes
- Replication lag (asynchronous)
- Replication tends to break sometimes
- Needs manual failover, master is SPOF
14 of 67
15. www.seznam.cz
Master
Slave Slave Slave
Master
Slave Slave Slave
Master – Master
replication
DC 1 DC 2
• Introduce second DC
+ Geographical fault tolerance
+ “hot” backup
+ Maintenance in one DC does not affect traffic.
- Still only one “active” master
- Where is the master?
- Cross DC lag
15 of 67
16. www.seznam.cz
Scaling of writes - sharding.
• Shard
• Same structure
• Different subset of data
• Horizontal scaling of writes
• Two approaches: Multitenancy routing, colocation routing
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
16 of 67
17. www.seznam.cz
Shard manager
We have to solve routing of application requests to correct shard.
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
17 of 67
18. www.seznam.cz
Shard manager
Where are John’s data?
We have to solve routing of application requests to correct shard.
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
18 of 67
19. www.seznam.cz
Shard manager
User John is on shard 2.
We have to solve routing of application requests to correct shard.
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
19 of 67
20. www.seznam.cz
We have to solve routing of application requests to correct shard.
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
Shard manager
20 of 67
21. www.seznam.cz
Cross-shard relations
For example messaging center
- Each message has sender
- Each message has recipient
- Potentialy each user is on different shard.
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
21 of 67
22. www.seznam.cz
Cross-shard relations
Possible solution: Duplicate data
+ Good solution for static data (enums)
- Difficult to maintain consistency in case of updates
(solved on application level)
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
22 of 67
23. www.seznam.cz
Cross-shard relations
Possible solution: Common data in separate database
+ Only one instance of data
+ No consistency problems
- Potentially less performant, we are back to the one
database solution.
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
Common DB
23 of 67
24. www.seznam.cz
We use both solutions on Sklik
- each is good for different subset of
data.
Common DB
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
24 of 67
25. www.seznam.cz
Summary
+ Almost unlimited horizontal scaling
+ Good for high load applications.
- Bad for analytical querying over all shards
- Common data problem
- Routing on application level
- A lot of components to monitor and maintain
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
25 of 67
26. www.seznam.cz
Balancing of shard load
- You can add another shard
- You can move data between shards
Problem: PK collisions
- No AUTO_INCREMENT
- ID allocation must be handled on application level
- We are using shard manager to assign IDs
Master Master MasterShard 1 Shard 2 Shard 3
Slave Slave Slave Slave Slave Slave Slave Slave Slave
26 of 67
28. www.seznam.cz
Node 1 Node 2
BEGIN
Query 1
Query 2
Query 3
COMMIT
Transaction
Transaction transfered to other nodes
OK or ROLLBACK
Certification
Transaction applied
asynchronously
COMMIT
result
(OK or
ROLLBACK)
Physical
commit
User interaction
Additional time
required for commit
28 of 67
29. www.seznam.cz
+ HA without manual failover
+ Read scaling
+ Write scaling
+ Automatic resync of failed nodes
- Conflict detection at commit
- InnoDB only (who uses MyISAM these days?)
- Difficult DDL statements (rolling schema upgrade)
- Maximum transaction size 2GB
29 of 67
30. www.seznam.cz
Shard 1 - Master-slave
Master
Slave Slave Slave
Shard 2 - Galera
1. Prepare empty shard based on Galera
2. Migrate all users to that shard
3. Drop old shard and use hardware for new Galera shard
4. Move (some) users back to original shard
30 of 67
31. www.seznam.cz
Shard 1 - Master-slave
Master
Slave Slave Slave
Shard 2 - Galera
Migrate all users
1. Prepare empty shard based on Galera
2. Migrate all users to that shard
3. Drop old shard and use hardware for new Galera shard
4. Move (some) users back to original shard
31 of 67
32. www.seznam.cz
Shard 1 - Galera Shard 2 - Galera
1. Prepare empty shard based on Galera
2. Migrate all users to that shard
3. Drop old shard and use hardware for new Galera shard
4. Move (some) users back to original shard
32 of 67
33. www.seznam.cz
Migrate some users
back
1. Prepare empty shard based on Galera
2. Migrate all users to that shard
3. Drop old shard and use hardware for new Galera shard
4. Move (some) users back to original shard
Shard 2 - GaleraShard 1 - Galera
33 of 67
34. www.seznam.cz
Shard 1 - DC 1
Master
Slave Slave Slave
Shard 1 - DC 2
Master
Slave Slave Slave
Master-Master
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
Traffic Traffic
34 of 67
35. www.seznam.cz
Shard 1 - DC 1
Master
Slave Slave Slave
Shard 1 - DC 2
Master
Slave Slave Slave
Master-Master
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
Traffic Traffic
35 of 67
36. www.seznam.cz
Shard 1 - DC 1
Master
Slave Slave Slave
Shard 1 - DC 2
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
Traffic
Galera quorum
36 of 67
37. www.seznam.cz
Shard 1 - DC 1
Master
Slave Slave Slave
Shard 1 - DC 2
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
Traffic
Galera quorum
Master-Master
37 of 67
38. www.seznam.cz
Shard 1 - DC 1
Master
Slave Slave Slave
Shard 1 - DC 2
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
Traffic
Galera quorum
Master-Master Traffic
38 of 67
39. www.seznam.cz
Shard 1 - DC 1
Master
Slave Slave Slave
Shard 1 - DC 2
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
Galera quorum
Master-Master Traffic
39 of 67
40. www.seznam.cz
Shard 1 - DC 1 Shard 1 - DC 2
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
Traffic
40 of 67
41. www.seznam.cz
Shard 1 - DC 1 Shard 1 - DC 2
1. Disconnect master-master replication between DCs, traffic goes to DC 1.
2. Drop shard at DC 2, recreate as Galera cluster
3. Reestablish master-master replication, let galera cluster catch up with DC 1.
4. Redirect traffic to DC 2
5. Disconnect master-master replication between DCs, traffic goes to DC 2.
6. Drop shard at DC 1, attach it to Galera cluster in DC 2, Galera node provisioning does the rest.
7. Reattach traffic to DC 1.
TrafficTraffic
41 of 67
44. www.seznam.cz
• mysqldump
+ Easy to use
+ Can backup only selected databases/tables
- No data consistency
- Really slow
• Percona XtraBackup
+ Online backup of whole tablespace
+ Strictly consistent
+ Only copies data + differential binlog
- InnoDB only (again, who uses MyISAM nowadays?)
- You cannot select certain databases/tables.
44 of 67
49. www.seznam.cz
HBase - Sklik.cz - real example
● 10 millions keywords
● 120 statistical values per keyword, per day
● for one year period:
○
○
● hundreds of thousands users
49 of 67
50. www.seznam.cz
What is HBase?
● NoSQL, BigTable(in Java),
● KeyValue, ColumnBased(ColumnFamily)
● distributed, scalable
● fault tolerant
● strong consistency (CP)
● availability?
● petabytes of data
50 of 67
52. www.seznam.cz
Tables and rows
⋮
User02Key01 |
User02Key00 |
● keys, data
● sorted lexicographically
● binary data
User02KeyZZ |
UserXYKeyZY |
UserXYKeyZZ |
⋮
Data …..
Data …..
Data …..
Data …..
Data …..
⋮
⋮
● contain rows
● defined during design
● “readable names”
Tables:
Rows:
52 of 67
53. www.seznam.cz
Columns
● qualifier
● sparse matrix
● key-value
● binary names and data
● even names can contain data
● sorted lexicographically
Rowkey
User02Key00 2012/01/01->data 2012/01/02->data
User02Key01 2009/12/23 -> long data 2010/12/23->data
key 1-> value Key 2-> long value
53 of 67
54. www.seznam.cz
Versions
● every cell(column) can contain versioned data
● every value is versioned
● long integer
● again arbitrary values
● sorted in descending order
● version count can be configured
54 of 67
56. www.seznam.cz
Column Family
● columns grouped to logical “units”
● separated physical storage
● ColumnFamily based
● optimization
Rowkey
User02Key01
keyAA=val1, keyAB=val2 keyAA=val2, keyBB=val2 ...
2011/12/23=data, 2012/12/23=data 2011/12/23=data2
Fulltext Context
56 of 67
57. www.seznam.cz
Data architecture in HBase
● data in tables - sparse matrix
● binary data
● regions
● columns ; ColumnFamily
● versions
● no joins, no foreign keys
● variable scheme
57 of 67
59. www.seznam.cz
Sequential reading
● use sorted rows and columns
● can be restricted to column family
● filters - almost “endless” optimization possibilities
● very fast
59 of 67
62. www.seznam.cz
HBase use case
● lexicographically sorted rows and columns
● binary data and keys
● variable scheme
● coprocessors
● sharded data
Properties
62 of 67
63. www.seznam.cz
● sequential reading
● variable scheme
● data divided to collections
● really lot of data
● transactional processing
● not enough HW
● variable queries
● random writes, a lot of updates (or deletes)
Pros:
Cons:
63 of 67
HBase use case
64. www.seznam.cz
RDBMS vs HBase scheme
● entity and their relationships description vs
query-first
● data normalization vs duplicated informations
● key design emphasis
● clustering
64 of 67