7. Big Data Analytics At Etsy.com
History of Etsy
Data architecture of Etsy through the years
Need for Data Analytics
Growth of “Big Data” service needs
Hadoop vs. other solutions
Vertica
Schlep and Autoschlep tools for data replication
8. About Etsy
Founded in 2005 in New York City by three NYU students
Etsy is the leading marketplace for handcrafted goods
More than 20 million users and one million sellers
Etsy has will have sales of $1 Billion this year
400 employees, 200 engineers
50 remote employees all over the world
Now has mobile and iPad applications
9. Etsy Stack
PHP front end (Rasmus Lerdorf, creator of PHP, is on staff)
Originally had PHP-->Python (Twisted Framework) middle layer-->PostgreSQL
Now PHP-->ORM (written in PHP)-->MySQL/PostgreSQL
10. Original Etsy Stack
PHP front end servers
Middle layer written in Python which bound Python functions to PostgreSQL stored
procedures; business logic in stored procedures and views
Started with one master PostgreSQL database (users, listings, sales, forums,
conversations, “showcases”)
When business increased, physically separated database components into separate
PostgreSQL servers:
Master (users, listings, transactions)
Forums (community message boards)
Convos (messaging between buyers and sellers)
Showcases (advertising slots purchased by sellers in various categories)
Listing view counts kept in memory – problem when system had to be restarted, sellers
lost the view counts for listings! Solved by creating another PostgreSQL server just for
recording view counts for listings.
Search originally accomplished with Full Text Indexing on PostgreSQL master database.
Did not scale, so introduced SOLR search in 2009 (inverted search lookup).
11. Problems with the original Etsy stack
Data architecture did not scale – one master database is not a scalable architecture
Stored procedures contained a lot of business logic, which resulted in higher database
CPU load than necessary
Difficult to hire engineers who could code both PHP front end and stored procedure logic
in PostgreSQL – thus hard to change business logic and difficult to deploy changes
Even hard to find engineers who could code Twisted framework in Python
Python middle layer was a “stored procedure routing system” (called “sprouter”) and
would bind PostgreSQL stored procedures and views to Python function calls
Worked on OID (Object ID) level – would bind PG objects to Python object by using
OIDs
When PG objects were recreated, they would acquire a new OID, and the sprouter
binding would break
Any change required complete reboot of sprouter layer to acquire new OIDs
This was “Version 2” architecture, which replaced previous one that was even worse.
The development of Version 2 took 1.5 years and almost killed the company! WATCH
OUT FOR VERSION 2!!!
12. The NEW Etsy Stack and Architecture
Started with key senior position hires from Flickr who had scaled that site with sharding
Redesigned data architecture:
Remove business logic – no more stored procedures
Database now just for “dumb storage” of facts and dimensions
Get rid of sprouter layer
Create new PHP business logic layer, called EtsyORM
Use better PHP templating
AJAX interactive client side logic
Generally replace PostgreSQL with MySQL
Because Flickr people were more familiar with it
Sharded MySQL databases
Plays well with PHP (part of LAMP stack, mature technology)
Denormalized data; PostgreSQL data was fairly normalized to reduce data footprint
Shard along lines that make sense
Use universal dedicated ticket server to issue unique sequence values
Keep related data as close together as possible, and closeness of data is more
important than data footprint. Redundancy is OK!
13. Client Requests
EtsyORM
MySQL Ticker Server
Lookup Index
PostgreSQL master MySQL shards
14. New Data Architecture – Good and Not So Good
Good:
Sharded data is great for single record lookup
Load is spread out over many servers
Lookup is fast
Scaling data capability is easy because it is horizontal – just add more servers
No business logic in database, all logic instead is in ORM layer
Not so good:
Sharded data, while good for single record lookup, is not good for aggregation
Data is now spread out over several machines instead of concentrated in one
Data has to be periodically “rebalanced” across shards
Querying aggregated data now is harder – can't issue simple SQL commands, the data
has to be mapped and reduced
Writing the queries is now an engineer job because query needs to go through ORM
It is hard to do analysis on the business, because that requires a lot of aggregated data
15. Client Requests
EtsyORM
MySQL Ticker Server
Lookup Index
PostgreSQL master MySQL shards
17. Current BI Architecture Problems
Problems with BI Server:
BI server has become like another master database, but with data from all sources
including shards
BI server is very overloaded with data replication tasks and report queries
BI server is PostgreSQL and not well suited for aggregation and analytics
BI Server is often very slow
Problems with Hadoop:
Hadoop is batch oriented – not good for “ad hoc” queries
Programming Hadoop jobs is tedious and often takes a lot of trial and error to get right
Creating Hadoop jobs is a specialized programming task that business analysts are not
able to perform
18. Etsy's solution to the BI analytics problem: Vertica
Etsy licensed Vertica to be its new BI server platform
Vertica is a licensed product, and was bought by Hewlett-Packard last year
Was designed at MIT by Prof. Michael Stonebraker, who is known as the “Father of
Postgres”
Vertica shares many internals with PostgreSQL. SQL parser and vsql command line
client are derived from PostgreSQL
Vertica is a “columnar store” database which is optimized for data analytics – it excels
at data aggregation
Vertica is a licensed, but there is a free version (one node, 1 TB data) which is very
useful
Vertica is multiple peer architecture; typical installation has several nodes, each equal
to the other. It “shards” out of the box and handles all distribution of data
Vertica puts a copy of small tables on each node, but segments large tables across
nodes with internal hash algorithm. It does this seamlessly, so very easy to set up and
manage
Has very rich SQL language, with good analytic queries and such things as windowing.
Most queries that run on PostgreSQL BI run unchanged on Vertica.
19. What is a Columnar Store database?
Traditional Relation Database stores rows, and those rows are indexed for fast retrieval of
records; Columnar store instead stores ordered columns, not rows
Vertica has no traditional indexes, although it has primary key and foreign key constraints;
uses encoding (preferable run length encoding “RLE”)
Relational database: Columnar database (Vertica):
RLE
id user_id charge_type amount
1 101 sales fee 2.33 id user_id charge_type amount
3 56 listing fee 0.20
2 101 sales fee 1.22 4 23 listing fee 0.20
3 56 listing fee 0.20 3
r
o
w
s
1 101 sales fee 2.33
4 23 listing fee 0.20
2 101 sales fee 1.22
5 128 sales fee 3.56
20. Getting Data To Vertica was a problem, but Etsy wrote a solution
There are no ETL (Extract, Transform, Load) tools for Vertica, except some support for
moving data from HDFS
No ETL for getting data from relational databases over to Vertica
Etsy had the requirement that we need to get all data from MySQL shards and
PostgreSQL databases into Vertica for it to be usefu to the business analysts
Etsy created two tool, schlep and autoschlep, to accomplish ETL from relational
databases, and we are going to open source them
21. About schlep
Schlep: Yiddish word meaning, “To carry a heavy load a long distance”
We built into Vertica as a SQL function, so that it is easy for the analysts to use. Schlep is
overloaded and has 5 variants to allow additional options. It is simple to use:
> SELECT * FROM schlep(user, 'table_name');
This will move the table the BI PostgreSQL database into Vertica. It does the following:
Connects to PostgreSQL BI and obtains the DDL for the table
Maps the data types to Vertica types
Creates the table with correct permissions on Vertica
Copies data over to Vertica by creating a psql COPY process and piping that into vsql
(Vertica) COPY process
Is very fast because Vertica does not check constraints by default when data is loaded
Schlep is a “one shot” static snapshot of the data. Once copied to Vertica, there is no
further update
Works with PostgreSQL right now, MySQL replication release in early November
22. About autoschlep
Autoschlep is a system that allows incremental replication(trickle load) of data from
source to
Vertica.
Currently works with PostgreSQL, MySQL coming soon
Works by putting an “after” trigger on source data table. Any CRUD (create, update,
delete) is recorded by the the trigger in a staging table on the source database
The autoschlep process is scheduled by cron or the autoschlep scheduler, or whatever
scheduling system of choice
Autoschlep then uses schlep to move the data from the staging table on the source
database and puts it in an identical staging table on Vertica. It then does a MERGE of
that data into the target table
Etsy has used schlep and autoschlep to move billions of records to Vertica and keep it
synchronized within 15 minutes of the source data
Autoschlep is called this way:
/autoschlep.py schema_name table_name primary_key
23. Where to Get Vertica and the Schlep Tools
Vertica has a FREE version that is quite powerful. It is limited to one node and
maximum 1 terabyte of data, but this can be very useful.