Verisure migrated their data warehouse from using Tungsten Replicator to native multi-source replication in MySQL 5.7 to simplify operations. They loaded data from production shards into the new data warehouse setup using XtraBackup backups and improved replication capacity with MySQL's parallel replication features. Some issues were encountered with replication lag reporting and crashes during the upgrade but most were resolved. Monitoring and management tools also required updates to support the new multi-source replication configuration.
5. Verisure
Verisure is Europe's leading provider of professionally
monitored home alarms and services for the connected and
protected home and business.
We believe it's a human right to feel safe and secure.
We connect and protect what really matters, our service
brings peace of mind to families and small business owners.
Thanks to our strong focus on quality and service, our
customers are among the most satisfied in the industry.
https://www.verisure.com/our-offer.html
5 / 46
7. Data Warehouse
Why the DataWarehouse setup ?
Troubleshooting tool for 3-line.
Not possible to have BI optimized DDL in Prod.
BI-teams in own deploy structure/schedule
Heavy data mining to follow up on :
Product quality
Gsm usage/costs
Stage for Upgrade
7 / 46
8. Data Warehouse
Getting started
First iteration was easy
Old prod hardware was kept as a Datawarehouse.
Then you add sharding
And things got a bit harder
Maybe we could use tungsten ?
8 / 46
14. Tungsten Replicator
Direct Mode
Due to legacy reasons, direct mode of tungsten is used.
Separate host was configured to serve as tungsten host:
~0.15ms round trip time to database as extra
THL requires disk space:
Replication LAG = lot of disk space.
Ended up with several shard clusters with tungsten
instances
14 / 46
16. Tungsten Replicator
Bugs
Issue 960 (fixed in Tungsten Replicator 3.0):
When using statement based replication with temporary
tables where a ROLLBACK of a commit is applied, the
replicator would fail to execute the rollback statement.
... and just commit the to be rollbacked transaction.
Before the fix, replication broke a lot and shards had
to be rebuilt regularly.
16 / 46
17. Tungsten Replicator
Operational Overhead
Hard for by Non-DBA's such as oncall staff
Hard ... even for DBA's
Custom Percona Toolkit Plugin For Tungsten Replicator:
https://github.com/grypyrg/percona-toolkit-plugin-
tungsten-replicator
$ pt-table-checksum -u checksum --no-check-binlog-format
--recursion-method=dsn=D=p,t=dsns --plugin=pt-plugin-tungsten_replicator.pl
Created plugin from /vagrant/pt-plugin-tungsten_replicator.pl.
PLUGIN get_slave_lag: Using Tungsten Replicator to check replication lag
Tungsten Replicator status of host node3 is OFFLINE:NORMAL, waiting
Replica node3 is stopped. Waiting.*
Tungsten Replicator status of host node3 is OFFLINE:NORMAL, waiting
Replica lag is 119 seconds on node3. Waiting.
TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
07-03T10:49:54 0 0 2097152 7 0 213.238 app.large_table
17 / 46
19. Move to MySQL 5.7
Why?
MSR to replace Tungsten Replicator:
built-in solution, easy operationally
replication capacity: parallel replication
less infrastructure required
easier to train oncall staff
The start to validate and get experience with
MySQL/Percona Server 5.7
19 / 46
20. Move to MySQL 5.7
Native replication replaces Tungsten
20 / 46
22. MySQL 5.7
Data Warehouse Queries
Collect queries (slowlog)
Replay with pt-upgradeon 2 dw
22 / 46
23. MySQL 5.7
Data Warehouse Queries
few queries were reported slower:
sometimes prefers worse index
to be further investigated
table: alarms
partitions: p201401,p201603,p201604
type: range
key: alarm_insid_sid_time_ix
key_len: 13
rows: 165
Extra: Using index condition; Using where; Using temporary; Using filesort
table: alarms
partitions: p201401,p201603,p201604
type: range
key: alarm_insid_time_ix
key_len: 9
rows: 8089
Extra: Using index condition; Using where
23 / 46
25. Multi Source Replication
Syntax
Create user
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.*
TO 'repluser05'@'192.168.204.10'
IDENTIFIED BY 'rFAQKARW8rLZ9b2Z';
Figure out where to start
cat xtrabackup_binlog_info
mysql-bin.203534 53973866
25 / 46
27. Multi Source Replication
Syntax
CHANGE MASTER TO MASTER_HOST='192.168.204.50',
MASTER_USER='repluser05',
MASTER_PASSWORD='rFAQKARW8rLZ9b2Z',
MASTER_LOG_FILE='mysql-bin.203534',
MASTER_LOG_POS=53973866
FOR CHANNEL 'host05';
SHOW SLAVE STATUS FOR CHANNEL 'host05'G
STOP SLAVE IO_THREAD FOR CHANNEL 'host05';
RESET SLAVE FOR CHANNEL 'host05';
27 / 46
28. Multi Source Replication
Loading the data
At first you setup replication before shards is used
But sooner or later a reload is needed.
Challenges
Physical backups can't be used to merge several
instances
TB sized databases and mysqldump, not efficient
load of data must be fast, or replication will never
catch up. (based on past experience with Tungsten)
Production is 5.6 and DW 5.7.
Partitioned tables not supported for IMPORT
TABLESPACE.
28 / 46
29. Multi Source Replication
Loading the data
Dump the data using xtrabackup
--export --prepare
Dump the schema using mysqldump
--no-data --triggers --routines
Restore the DDL
mysql < ddl.sql
Load the data
discard tablespace
cp
import tablespace
29 / 46
30. Multi Source Replication
Loading the data Tips and Tricks
5.5 -> 5.6
Tables with timestamps must be rebuilt to new format
Requires a extra machine to use for the rebuild.
Load
ALTER TABLE FORCE
Dump and start the Load
5.6 -> 5.7
Tables must be created with row_format=COMPACT
ALTER TABLE ROW_FORMAT=COMPACT
30 / 46
31. Multi Source Replication
Loading the data Tips and Tricks
5.6: Partitioned tables
Not supported, but
Import each partition as a separate table
Add to table using EXCHANGE PARTITION
Supported in 5.7, but no time to test yet...
31 / 46
32. Multi Source Replication
Skipping a Trx, non-GTID:
mysql> SET GLOBAL sql_slave_skip_counter=1;
mysql> START SLAVE;
ERROR 3086 (HY000): When sql_slave_skip_counter > 0, it is not allowed to
start more than one SQL thread by using 'START SLAVE [SQL_THREAD]'.
Value of sql_slave_skip_counter can only be used by one SQL thread at a time.
Please use 'START SLAVE [SQL_THREAD] FOR CHANNEL' to start the SQL thread
which will use the value of sql_slave_skip_counter.
mysql> START SLAVE FOR CHANNEL 'one';
32 / 46
33. Multi Source Replication
Replication Filters
Replication filters cannot be configured per channel:
http://bugs.mysql.com/bug.php?id=80843
33 / 46
37. Replication Capacity Improvements
New environment has lower replication capacity with
largest shards.
Waiting for slave-parallel-type=LOGICAL_CLOCK
Waiting on App to become ready for
binlog_format=ROW
Need more in depth analysis of the collected statistics
37 / 46
40. Crash: innodb_open_files>
open_files_limit
http://bugs.mysql.com/bug.php?id=78981
Fixed in 5.6.30, 5.7.12, 5.8.0
| Variable_name | Value |
+-------------------+-------+
| innodb_open_files | 16384 |
| open_files_limit | 8510 |
2015-10-27 10:20:33 5535 [ERROR] InnoDB: Trying to do i/o to a tablespace which
2015-10-27 10:20:33 7fa725a05700 InnoDB: Error: trying to access tablespace 11015
InnoDB: but the tablespace does not exist or is just being dropped.
2015-10-27 10:20:33 7fa725a05700 InnoDB: Operating system error number 24 in a fi
InnoDB: Error number 24 means 'Too many open files'.
InnoDB: Some operating system error numbers are described at
...
2015-10-27 10:20:33 7fa725a05700 InnoDB: Assertion failure in thread
140355867531008 in file buf0buf.cc line 2740
InnoDB: We intentionally generate a memory trap.
40 / 46
41. Crash: Upgrade from 5.6 to 5.7 MSR
Replication channels are getting same name in MSR
after upgrade, can also Crash MySQL
https://bugs.mysql.com/bug.php?id=80302 -- Open :(
mysql> show slave statusG
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_Port: 11204
[..]
Channel_Name: master1
Master_TLS_Version:
*************************** 2. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_Port: 13358
[..]
Channel_Name: master1
Master_TLS_Version:
2 rows in set (0.00 sec)
41 / 46
44. InnoTop Multi Source Support
Written by Johan Nilsson (Verisure)
Soon to be merged:
https://github.com/innotop/innotop/pull/129
[RO] Replication Status (? for help) 127.0.0.1, 3m, 1.93 QPS, 5/1/0 con/run/cac th
________________________________ Slave SQL Status ______________________________
Channel Master Master UUID On? TimeLag Catchup RPos Last
one localhost d7e93be0-0452-08002774c31b Yes 00:00 0.00 327
two localhost 5b9d58e4-0452-08002774c31b Yes 00:00 0.00 4
________________________________ Slave I/O Status _______________________________
Channel Master Master UUID On? File RSize Pos
two localhost 5b9d58e4-0472-08002774c31b No 57-co.bin.000003 154
one localhost d7e93be0-04b2-08002774c31b Yes 57-co.bin.000003 545
____________________________________________ Master Status _______________________
File Position Binlog Cache Executed GTID Set Server UUID
57-community-bin.000003 154 0.00% N/A b40426f3-045
44 / 46
45. Monitoring Tools
Our favorite things
Mytop
innotop
Patch for channels
Ichinga/Nagios
Mrtg
Some Mysql metrics that are important for us.
Grafana/Graphite/Collect
45 / 46