This goes over an overview of the architecture, and then goes into the development data problem. It also talks about some tools we use to do data migrations and schema changes.
The Shard Revisited: Tools and Techniques Used at Etsy
1. The Shard Revisited
Tools and Techniques Used at Etsy
jgoulah@etsy.com / @johngoulah
Tuesday, November 12, 13
2. Tuesday, November 12, 13
A marketplace for people around the world to connect, buy, and sell
unique goods
Etsy is the marketplace that we all make together,
and our mission is to re-imagine commerce in ways that build a more
fulfilling and lasting world
4. Tuesday, November 12, 13
this talk consists of the architecture, our dev data problem/solution, and
other tools
big cluster, 35 shards
5. 6TB InnoDB buffer pool
30TB+ data stored
100K+ queries/sec avg
~1.8Gbps outbound (plain text)
99.9% queries under 1ms
Tuesday, November 12, 13
1/3 RAM not dedicated to the pool (OS, disk, network buffers, etc)
6. ~100 MySQL servers
1100 15K rpm disks / 1600+ CPU’s
Server Spec
HP DL 380 G8
96GB RAM
16 spindles / 2TB RAID 10
24 Core
Tuesday, November 12, 13
16 x 146GB
8. Redundancy
Tuesday, November 12, 13
the duplication of critical components of a system with the intention of
increasing reliability
example: jet engines
25. Index Servers
Tuesday, November 12, 13
have to be able to find the data, these simply exist to look up where the
data is
to answer the question: what shard is the data on?
http://www.flickr.com/photos/mamsy/4175783446/sizes/l/in/
photostream/
32. Globally Unique ID
Tuesday, November 12, 13
can’t use auto-increment with distributed system, hand out globally
unique id’s
33. CREATE TABLE `tickets` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`stub` char(1) NOT NULL default '',
PRIMARY KEY (`id`),
UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM
Tuesday, November 12, 13
only myisam tables, leverage myisam engine's lack of concurrency
34. Ticket Generation
REPLACE INTO tickets (stub) VALUES ('a');
SELECT LAST_INSERT_ID();
Tuesday, November 12, 13
since value ‘a’ exists, it replaces the row with the same value (and bumps
the id)
if an old row in the table has the same value as a new row for a
PK or a UNIQUE index, the old row is deleted before the new row is
inserted
35. Ticket Generation
REPLACE INTO tickets (stub) VALUES ('a');
SELECT LAST_INSERT_ID();
SELECT * FROM tickets;
id
4589294
Tuesday, November 12, 13
stub
a
36. tickets A
auto-increment-increment = 2
auto-increment-offset = 1
tickets B
auto-increment-increment = 2
auto-increment-offset = 2
Tuesday, November 12, 13
ODD:offset=1
EVEN: offset=2
http://openclipart.org/detail/94723/database-symbol-by-rg1024
37. tickets A
auto-increment-increment = 2
auto-increment-offset = 1
tickets B
auto-increment-increment = 2
auto-increment-offset = 2
NOT master-master
Tuesday, November 12, 13
failure is ok, only lose last ticket id
can bring another server up with new offset
http://openclipart.org/detail/94723/database-symbol-by-rg1024
38. Shards
Tuesday, November 12, 13
shards hold the majority of the data
http://www.flickr.com/photos/merrickb/63999750/sizes/o/in/
photostream/
39. Object Hashing
....aka pinning data to one side of the shard
Tuesday, November 12, 13
after we determine the shard we have to determine side A or side B
given the replicant index
also helps keep connections to a (relative) minimum since all stuff
sharded by a specific instance will then pick the same side
40. A
user_id : 500
Tuesday, November 12, 13
so we know the shard, now which replicant
object id in this case is user_id
side a/b are replicants
B
54. A
B
user_id : 500 % (1) == 0
user_id : 501 % (1) == 0
Tuesday, November 12, 13
55. Variants
Tuesday, November 12, 13
variants are mirrors of the same data in different tables
http://www.flickr.com/photos/garibaldi/522196113/sizes/o/in/
photostream/
64. DATA
Tuesday, November 12, 13
sync prod to dev, until prod data gets too big
http://www.flickr.com/photos/uwwresnet/6280880034/sizes/l/in/
photostream/
65. Some Approaches
subsets of data
generated data
Tuesday, November 12, 13
subsets have to end somewhere (a shop has favorites that are connected
to people, connected to shops, etc)
generated data can be time consuming to fake
67. Edge Cases
Tuesday, November 12, 13
what about testing edge cases, difficult to diagnose bugs?
hard to model the same data set that produced a user facing bug
http://www.flickr.com/photos/kalexanderson/6199793967/sizes/o/in/
photostream/
68. Complexity
Tuesday, November 12, 13
another issue is testing problems at scale, complex and large gobs of
data
real social network ecosystem can be difficult to generate (favorites,
follows)
(activity feed, “similar items” search gives better results in prod)
http://www.flickr.com/photos/doug88888/4687906267/sizes/o/in/
photostream/
69. Copy prod data to dev ?
Tuesday, November 12, 13
what most people do before data gets too big,
almost 3 days to sync 30Tb over 1Gbps link, close to 10 hrs over
10Gbps
bringing prod dataset to dev was expensive hardware/maint,
keeping parity with prod, and applying schema changes would take at least
as long
70. instead....
Use Production
(sometimes)
Tuesday, November 12, 13
so we did what we saw as the last resort - used production
not for greenfield development, more for mature features and diagnosing bugs
we still have a dev database but the data is sparse and unreliable
71. Tuesday, November 12, 13
goes without saying this can be dangerous, and people have to be aware
they are doing it
http://instagram.com/p/d8nw9aNqlt/
http://www.flickr.com/photos/stuckincustoms/432361985/sizes/l/in/
photostream/
79. Tuesday, November 12, 13
proxy hits all of the shards/index/tickets
http://www.oreillynet.com/pub/a/databases/2007/07/12/getting-started-with-mysql-proxy.html
80. explicitly enabled
% dev_proxy on
Dev-Proxy config is now ON. Use
'dev_proxy off' to turn it off.
Tuesday, November 12, 13
Not on all the time
84. Tuesday, November 12, 13
read-write mode, needed for login and other things that write data
85. % ./bin/myscript
YOU CURRENTLY HAVE THE READ WRITE PROXY TURNED ON AND ARE
RUNNING A CLI SCRIPT!!!
You must type the phrase 'read write proxy' and press enter to continue...
Tuesday, November 12, 13
86. known input/output
Tuesday, November 12, 13
we know where all of the queries from dev originate from
http://www.flickr.com/photos/medevac71/4875526920/sizes/l/in/
photostream/
87. dangerous/unnecessary queries
(DEV) etsy_rw@jgoulah [test]>
select * from fred_test;
ERROR 9001 (E9001): Selects from
tables must have where clauses
Tuesday, November 12, 13
-- filter dangerous queries - (queries without a WHERE)
-- remove unnecessary queries - (instead of DELETE, have a flag, ALTER
statements don’t run from dev)
89. 2013-04-22 18:05:43 485370821 devproxy --
date
thread id
/* DEVPROXY source=10.101.194.19:40198
source ip
uuid=c309e8db-ca32-4171-9c4a-6c37d9dd3361
unique id generated by proxy
[htSp8458VmHlC] [etsy_index_B] [browse.php] */
app request id
SELECT id FROM table;
Tuesday, November 12, 13
dest. shard
script
91. stealth data
Tuesday, November 12, 13
hiding data from users
(favorites go on dev and prod shard, making sure test user/shops don’t
show up in search)
http://www.flickr.com/photos/davidyuweb/8063097077/sizes/h/in/
photostream/
92. overlays
Tuesday, November 12, 13
An overlay is a local copy of production data
If there are overlays in place in dev, it will send the queries to the local db
instead
(it does this by overriding looking up the shard on index, and checks for table/pk
pair).
94. Delayed Slaves
Tuesday, November 12, 13
pt-slave-delay watches a slave and starts and stops its replication SQL thread as
necessary to hold it
http://www.flickr.com/photos/xploded/141295823/sizes/o/in/
photostream/
95. Delayed Slaves
4 hour delay behind master
produce row based binary logs
allow for quick recovery
Tuesday, November 12, 13
role of the delayed slave
also source of BCP
(business continuity planning - prevention and recovery of threats)
96. pt-slave-delay --daemonize
--pid /var/run/pt-slave-delay.pid --log /var/log/pt-slave-delay.log
--delay 4h --interval 1m --nocontinue
Tuesday, November 12, 13
last 3 options most important,
4h delay, interval is how frequently it should check whether slave
should be started or stopped
nocontinue - don’t continue replication normally on exit (don’t catch
up with master)
user/pass eliminated for brevity
112. Prevent disk from filling
High traffic objects (shops, users)
Tuesday, November 12, 13
high traffic == disk usage and I/O util
113. Prevent disk from filling
High traffic objects (shops, users)
Shard rebalancing
Tuesday, November 12, 13
rebalancing when adding new shards or shards fill unequally
126. Logical Shards
Tuesday, November 12, 13
Writing data into the new shard, deleting data from the old shard and then optimizing
every single table is a large amount of work
Instead can run a mysql process with many databases
129. Advantages
• multi threaded slave
• simpler migrations
Tuesday, November 12, 13
In MySQL 5.6 we have multi-threaded slave
but it can only do parallel processing if we have multiple MySQL schemas
(databases).
The cons is we have many more logical shards to maintain
143. COMMAND TIMINGS
===============
---------------------------------------------------------------------+ HOST: worker19, USER: , DB: 2, TIME: 4
---------------------------------------------------------------------select * from activity where owner_id = 7395036 and owner_type_id = 2
and deleted = 0 and creation_time >= 1382226430 and public = 1 order
by creation_time desc limit 0,50
---------------------------------------------------------------------+ HOST: worker27, USER: , DB: 2, TIME: 4
---------------------------------------------------------------------SELECT * FROM shop_stats WHERE shop_id = 5902046 AND currency_code =
'USD' AND sales_year = 2012 AND id != 2432609442
Tuesday, November 12, 13
146. qtop
Tuesday, November 12, 13
we send queries over UDP from our ORM, stick them in a db and to
analyze later
request context: request id, logged in user-id, what script is executing
avoid the perf hit of slow query log, and its realtime across all shards
because it originates from the client