SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Growing Data Analytics at Etsy

Chris Bohn (“CB”)
Fort Ross (Rossia) – Established 1812 as Csarist outpost




      Fort
      Ross
Fort Ross

– Close to Russian River
– City of Sebastopol is nearby
Russian Hill, San Francisco
Big Data Analytics At Etsy.com



    History of Etsy

    Data architecture of Etsy through the years

    Need for Data Analytics

    Growth of “Big Data” service needs

    Hadoop vs. other solutions

    Vertica

    Schlep and Autoschlep tools for data replication
About Etsy




    Founded in 2005 in New York City by three NYU students

    Etsy is the leading marketplace for handcrafted goods

    More than 20 million users and one million sellers

    Etsy has will have sales of $1 Billion this year

    400 employees, 200 engineers

    50 remote employees all over the world

    Now has mobile and iPad applications
Etsy Stack




    PHP front end (Rasmus Lerdorf, creator of PHP, is on staff)

    Originally had PHP-->Python (Twisted Framework) middle layer-->PostgreSQL

    Now PHP-->ORM (written in PHP)-->MySQL/PostgreSQL
Original Etsy Stack

    PHP front end servers

  Middle layer written in Python which bound Python functions to PostgreSQL stored
procedures; business logic in stored procedures and views

 Started with one master PostgreSQL database (users, listings, sales, forums,
conversations, “showcases”)

 When business increased, physically separated database components into separate
PostgreSQL servers:
  
    Master (users, listings, transactions)
  
    Forums (community message boards)
  
    Convos (messaging between buyers and sellers)
  
    Showcases (advertising slots purchased by sellers in various categories)

Listing view counts kept in memory – problem when system had to be restarted, sellers
lost the view counts for listings! Solved by creating another PostgreSQL server just for
recording view counts for listings.

Search originally accomplished with Full Text Indexing on PostgreSQL master database.
Did not scale, so introduced SOLR search in 2009 (inverted search lookup).
Problems with the original Etsy stack

    Data architecture did not scale – one master database is not a scalable architecture

 Stored procedures contained a lot of business logic, which resulted in higher database
CPU load than necessary

  Difficult to hire engineers who could code both PHP front end and stored procedure logic
in PostgreSQL – thus hard to change business logic and difficult to deploy changes

    Even hard to find engineers who could code Twisted framework in Python

 Python middle layer was a “stored procedure routing system” (called “sprouter”) and
would bind PostgreSQL stored procedures and views to Python function calls
 
   Worked on OID (Object ID) level – would bind PG objects to Python object by using
   OIDs
 
   When PG objects were recreated, they would acquire a new OID, and the sprouter
   binding would break
 
   Any change required complete reboot of sprouter layer to acquire new OIDs

    This was “Version 2” architecture, which replaced previous one that was even worse.
      The development of Version 2 took 1.5 years and almost killed the company! WATCH
      OUT FOR VERSION 2!!!
The NEW Etsy Stack and Architecture

    Started with key senior position hires from Flickr who had scaled that site with sharding

    Redesigned data architecture:
    
      Remove business logic – no more stored procedures
    
      Database now just for “dumb storage” of facts and dimensions
    
      Get rid of sprouter layer
    
      Create new PHP business logic layer, called EtsyORM
    
      Use better PHP templating
    
      AJAX interactive client side logic



    Generally replace PostgreSQL with MySQL
    
      Because Flickr people were more familiar with it
    
      Sharded MySQL databases
    
      Plays well with PHP (part of LAMP stack, mature technology)



    Denormalized data; PostgreSQL data was fairly normalized to reduce data footprint
    
      Shard along lines that make sense
    
      Use universal dedicated ticket server to issue unique sequence values
    
      Keep related data as close together as possible, and closeness of data is more
      important than data footprint. Redundancy is OK!
Client Requests




                               EtsyORM

MySQL Ticker Server


                                                        Lookup Index




           PostgreSQL master             MySQL shards
New Data Architecture – Good and Not So Good

Good:

    Sharded data is great for single record lookup
    
      Load is spread out over many servers
    
      Lookup is fast
    
      Scaling data capability is easy because it is horizontal – just add more servers
    
      No business logic in database, all logic instead is in ORM layer


Not so good:

    Sharded data, while good for single record lookup, is not good for aggregation
    
      Data is now spread out over several machines instead of concentrated in one
    
      Data has to be periodically “rebalanced” across shards
    
      Querying aggregated data now is harder – can't issue simple SQL commands, the data
      has to be mapped and reduced
    
      Writing the queries is now an engineer job because query needs to go through ORM
    
      It is hard to do analysis on the business, because that requires a lot of aggregated data
Client Requests




                               EtsyORM

MySQL Ticker Server


                                                        Lookup Index




           PostgreSQL master             MySQL shards
Business Intelligence (BI) Data Architecture




PostgreSQL master




                                                 PostgreSQL BI database




   MySQL shards
Current BI Architecture Problems

Problems with BI Server:

  BI server has become like another master database, but with data from all sources
including shards

    BI server is very overloaded with data replication tasks and report queries

    BI server is PostgreSQL and not well suited for aggregation and analytics

    BI Server is often very slow


Problems with Hadoop:

    Hadoop is batch oriented – not good for “ad hoc” queries

    Programming Hadoop jobs is tedious and often takes a lot of trial and error to get right

 Creating Hadoop jobs is a specialized programming task that business analysts are not
able to perform
Etsy's solution to the BI analytics problem: Vertica

Etsy licensed Vertica to be its new BI server platform

    Vertica is a licensed product, and was bought by Hewlett-Packard last year

 Was designed at MIT by Prof. Michael Stonebraker, who is known as the “Father of
Postgres”

  Vertica shares many internals with PostgreSQL. SQL parser and vsql command line
client are derived from PostgreSQL

  Vertica is a “columnar store” database which is optimized for data analytics – it excels
at data aggregation

 Vertica is a licensed, but there is a free version (one node, 1 TB data) which is very
useful

 Vertica is multiple peer architecture; typical installation has several nodes, each equal
to the other. It “shards” out of the box and handles all distribution of data

 Vertica puts a copy of small tables on each node, but segments large tables across
nodes with internal hash algorithm. It does this seamlessly, so very easy to set up and
manage

 Has very rich SQL language, with good analytic queries and such things as windowing.
Most queries that run on PostgreSQL BI run unchanged on Vertica.
What is a Columnar Store database?
      Traditional Relation Database stores rows, and those rows are indexed for fast retrieval of
     records; Columnar store instead stores ordered columns, not rows
     Vertica has no traditional indexes, although it has primary key and foreign key constraints;
     uses encoding (preferable run length encoding “RLE”)

     Relational database:                              Columnar database (Vertica):

                                                                                           RLE
id       user_id charge_type      amount
     1      101 sales fee                  2.33 id     user_id charge_type amount
                                                   3        56 listing fee          0.20
     2      101 sales fee                  1.22    4        23 listing fee          0.20
     3       56 listing fee                0.20                                            3
                                                                                           r
                                                                                           o
                                                                                           w
                                                                                           s
                                                   1       101 sales fee            2.33
     4       23 listing fee                0.20
                                                   2       101 sales fee            1.22
     5      128 sales fee                  3.56
Getting Data To Vertica was a problem, but Etsy wrote a solution

 There are no ETL (Extract, Transform, Load) tools for Vertica, except some support for
moving data from HDFS

    No ETL for getting data from relational databases over to Vertica

 Etsy had the requirement that we need to get all data from MySQL shards and
PostgreSQL databases into Vertica for it to be usefu to the business analysts

 Etsy created two tool, schlep and autoschlep, to accomplish ETL from relational
databases, and we are going to open source them
About schlep

Schlep: Yiddish word meaning, “To carry a heavy load a long distance”

We built into Vertica as a SQL function, so that it is easy for the analysts to use. Schlep is
overloaded and has 5 variants to allow additional options. It is simple to use:

> SELECT * FROM schlep(user, 'table_name');

This will move the table the BI PostgreSQL database into Vertica. It does the following:

    Connects to PostgreSQL BI and obtains the DDL for the table

    Maps the data types to Vertica types

    Creates the table with correct permissions on Vertica

  Copies data over to Vertica by creating a psql COPY process and piping that into vsql
(Vertica) COPY process

    Is very fast because Vertica does not check constraints by default when data is loaded

  Schlep is a “one shot” static snapshot of the data. Once copied to Vertica, there is no
further update

    Works with PostgreSQL right now, MySQL replication release in early November
About autoschlep


Autoschlep is a system that allows incremental replication(trickle load) of data from
source to
Vertica.

  Currently works with PostgreSQL, MySQL coming soon

  Works by putting an “after” trigger on source data table. Any CRUD (create, update,
delete) is recorded by the the trigger in a staging table on the source database

  The autoschlep process is scheduled by cron or the autoschlep scheduler, or whatever
scheduling system of choice

  Autoschlep then uses schlep to move the data from the staging table on the source
database and puts it in an identical staging table on Vertica. It then does a MERGE of
that data into the target table

  Etsy has used schlep and autoschlep to move billions of records to Vertica and keep it
synchronized within 15 minutes of the source data



Autoschlep is called this way:

/autoschlep.py schema_name table_name primary_key
Where to Get Vertica and the Schlep Tools

Vertica has a FREE version that is quite powerful. It is limited to one node and
maximum 1 terabyte of data, but this can be very useful.

Weitere ähnliche Inhalte

Was ist angesagt?

Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionAditya Trivedi
 
Oracle Advanced Analytics
Oracle Advanced AnalyticsOracle Advanced Analytics
Oracle Advanced Analyticsaghosh_us
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsKasper Sørensen
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceIBM Danmark
 

Was ist angesagt? (8)

Oracle 11g data warehouse introdution
Oracle 11g data warehouse introdutionOracle 11g data warehouse introdution
Oracle 11g data warehouse introdution
 
Oracle: DW Design
Oracle: DW DesignOracle: DW Design
Oracle: DW Design
 
Oracle Advanced Analytics
Oracle Advanced AnalyticsOracle Advanced Analytics
Oracle Advanced Analytics
 
Solr -
Solr - Solr -
Solr -
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
Apache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data pointsApache MetaModel - unified access to all your data points
Apache MetaModel - unified access to all your data points
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
The IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse applianceThe IBM Netezza datawarehouse appliance
The IBM Netezza datawarehouse appliance
 

Ähnlich wie Growing Data Analytics at Etsy (Cristopher Bohn)

IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
ArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudMicrosoft ArcReady
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World DistilledRTTS
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Enginefschupp
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 
Designing big data analytics solutions on azure
Designing big data analytics solutions on azureDesigning big data analytics solutions on azure
Designing big data analytics solutions on azureMohamed Tawfik
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processingSchubert Zhang
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx政宏 张
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at nightMichael Yarichuk
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
RedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.jsRedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.jsRedis Labs
 

Ähnlich wie Growing Data Analytics at Etsy (Cristopher Bohn) (20)

Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2Databases in the Cloud - DevDay Austin 2017 Day 2
Databases in the Cloud - DevDay Austin 2017 Day 2
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
ArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudArcReady - Architecting For The Cloud
ArcReady - Architecting For The Cloud
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
The Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage EngineThe Adventure: BlackRay as a Storage Engine
The Adventure: BlackRay as a Storage Engine
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 
Designing big data analytics solutions on azure
Designing big data analytics solutions on azureDesigning big data analytics solutions on azure
Designing big data analytics solutions on azure
 
Engineering practices in big data storage and processing
Engineering practices in big data storage and processingEngineering practices in big data storage and processing
Engineering practices in big data storage and processing
 
SPL_ALL_EN.pptx
SPL_ALL_EN.pptxSPL_ALL_EN.pptx
SPL_ALL_EN.pptx
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
RedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.jsRedisConf17 - IoT Backend with Redis and Node.js
RedisConf17 - IoT Backend with Redis and Node.js
 

Mehr von Ontico

One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...Ontico
 
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)Ontico
 
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Ontico
 
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Ontico
 
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Ontico
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)Ontico
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Ontico
 
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Ontico
 
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)Ontico
 
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)Ontico
 
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Ontico
 
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Ontico
 
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Ontico
 
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Ontico
 
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)Ontico
 
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Ontico
 
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Ontico
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...Ontico
 
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Ontico
 
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Ontico
 

Mehr von Ontico (20)

One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
 
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
 
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
 
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
 
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
 
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
 
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
 
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
 
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
 
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
 
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
 
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
 
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
 
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
 
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
 
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
 
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
 

Growing Data Analytics at Etsy (Cristopher Bohn)

  • 1. Growing Data Analytics at Etsy Chris Bohn (“CB”)
  • 2.
  • 3.
  • 4. Fort Ross (Rossia) – Established 1812 as Csarist outpost Fort Ross
  • 5. Fort Ross – Close to Russian River – City of Sebastopol is nearby
  • 6. Russian Hill, San Francisco
  • 7. Big Data Analytics At Etsy.com  History of Etsy  Data architecture of Etsy through the years  Need for Data Analytics  Growth of “Big Data” service needs  Hadoop vs. other solutions  Vertica  Schlep and Autoschlep tools for data replication
  • 8. About Etsy  Founded in 2005 in New York City by three NYU students  Etsy is the leading marketplace for handcrafted goods  More than 20 million users and one million sellers  Etsy has will have sales of $1 Billion this year  400 employees, 200 engineers  50 remote employees all over the world  Now has mobile and iPad applications
  • 9. Etsy Stack  PHP front end (Rasmus Lerdorf, creator of PHP, is on staff)  Originally had PHP-->Python (Twisted Framework) middle layer-->PostgreSQL  Now PHP-->ORM (written in PHP)-->MySQL/PostgreSQL
  • 10. Original Etsy Stack  PHP front end servers  Middle layer written in Python which bound Python functions to PostgreSQL stored procedures; business logic in stored procedures and views  Started with one master PostgreSQL database (users, listings, sales, forums, conversations, “showcases”)  When business increased, physically separated database components into separate PostgreSQL servers:  Master (users, listings, transactions)  Forums (community message boards)  Convos (messaging between buyers and sellers)  Showcases (advertising slots purchased by sellers in various categories) Listing view counts kept in memory – problem when system had to be restarted, sellers lost the view counts for listings! Solved by creating another PostgreSQL server just for recording view counts for listings. Search originally accomplished with Full Text Indexing on PostgreSQL master database. Did not scale, so introduced SOLR search in 2009 (inverted search lookup).
  • 11. Problems with the original Etsy stack  Data architecture did not scale – one master database is not a scalable architecture  Stored procedures contained a lot of business logic, which resulted in higher database CPU load than necessary  Difficult to hire engineers who could code both PHP front end and stored procedure logic in PostgreSQL – thus hard to change business logic and difficult to deploy changes  Even hard to find engineers who could code Twisted framework in Python  Python middle layer was a “stored procedure routing system” (called “sprouter”) and would bind PostgreSQL stored procedures and views to Python function calls  Worked on OID (Object ID) level – would bind PG objects to Python object by using OIDs  When PG objects were recreated, they would acquire a new OID, and the sprouter binding would break  Any change required complete reboot of sprouter layer to acquire new OIDs This was “Version 2” architecture, which replaced previous one that was even worse. The development of Version 2 took 1.5 years and almost killed the company! WATCH OUT FOR VERSION 2!!!
  • 12. The NEW Etsy Stack and Architecture  Started with key senior position hires from Flickr who had scaled that site with sharding  Redesigned data architecture:  Remove business logic – no more stored procedures  Database now just for “dumb storage” of facts and dimensions  Get rid of sprouter layer  Create new PHP business logic layer, called EtsyORM  Use better PHP templating  AJAX interactive client side logic  Generally replace PostgreSQL with MySQL  Because Flickr people were more familiar with it  Sharded MySQL databases  Plays well with PHP (part of LAMP stack, mature technology)  Denormalized data; PostgreSQL data was fairly normalized to reduce data footprint  Shard along lines that make sense  Use universal dedicated ticket server to issue unique sequence values  Keep related data as close together as possible, and closeness of data is more important than data footprint. Redundancy is OK!
  • 13. Client Requests EtsyORM MySQL Ticker Server Lookup Index PostgreSQL master MySQL shards
  • 14. New Data Architecture – Good and Not So Good Good:  Sharded data is great for single record lookup  Load is spread out over many servers  Lookup is fast  Scaling data capability is easy because it is horizontal – just add more servers  No business logic in database, all logic instead is in ORM layer Not so good:  Sharded data, while good for single record lookup, is not good for aggregation  Data is now spread out over several machines instead of concentrated in one  Data has to be periodically “rebalanced” across shards  Querying aggregated data now is harder – can't issue simple SQL commands, the data has to be mapped and reduced  Writing the queries is now an engineer job because query needs to go through ORM  It is hard to do analysis on the business, because that requires a lot of aggregated data
  • 15. Client Requests EtsyORM MySQL Ticker Server Lookup Index PostgreSQL master MySQL shards
  • 16. Business Intelligence (BI) Data Architecture PostgreSQL master PostgreSQL BI database MySQL shards
  • 17. Current BI Architecture Problems Problems with BI Server:  BI server has become like another master database, but with data from all sources including shards  BI server is very overloaded with data replication tasks and report queries  BI server is PostgreSQL and not well suited for aggregation and analytics  BI Server is often very slow Problems with Hadoop:  Hadoop is batch oriented – not good for “ad hoc” queries  Programming Hadoop jobs is tedious and often takes a lot of trial and error to get right  Creating Hadoop jobs is a specialized programming task that business analysts are not able to perform
  • 18. Etsy's solution to the BI analytics problem: Vertica Etsy licensed Vertica to be its new BI server platform  Vertica is a licensed product, and was bought by Hewlett-Packard last year  Was designed at MIT by Prof. Michael Stonebraker, who is known as the “Father of Postgres”  Vertica shares many internals with PostgreSQL. SQL parser and vsql command line client are derived from PostgreSQL  Vertica is a “columnar store” database which is optimized for data analytics – it excels at data aggregation  Vertica is a licensed, but there is a free version (one node, 1 TB data) which is very useful  Vertica is multiple peer architecture; typical installation has several nodes, each equal to the other. It “shards” out of the box and handles all distribution of data  Vertica puts a copy of small tables on each node, but segments large tables across nodes with internal hash algorithm. It does this seamlessly, so very easy to set up and manage  Has very rich SQL language, with good analytic queries and such things as windowing. Most queries that run on PostgreSQL BI run unchanged on Vertica.
  • 19. What is a Columnar Store database?  Traditional Relation Database stores rows, and those rows are indexed for fast retrieval of records; Columnar store instead stores ordered columns, not rows Vertica has no traditional indexes, although it has primary key and foreign key constraints; uses encoding (preferable run length encoding “RLE”) Relational database: Columnar database (Vertica): RLE id user_id charge_type amount 1 101 sales fee 2.33 id user_id charge_type amount 3 56 listing fee 0.20 2 101 sales fee 1.22 4 23 listing fee 0.20 3 56 listing fee 0.20 3 r o w s 1 101 sales fee 2.33 4 23 listing fee 0.20 2 101 sales fee 1.22 5 128 sales fee 3.56
  • 20. Getting Data To Vertica was a problem, but Etsy wrote a solution  There are no ETL (Extract, Transform, Load) tools for Vertica, except some support for moving data from HDFS  No ETL for getting data from relational databases over to Vertica  Etsy had the requirement that we need to get all data from MySQL shards and PostgreSQL databases into Vertica for it to be usefu to the business analysts  Etsy created two tool, schlep and autoschlep, to accomplish ETL from relational databases, and we are going to open source them
  • 21. About schlep Schlep: Yiddish word meaning, “To carry a heavy load a long distance” We built into Vertica as a SQL function, so that it is easy for the analysts to use. Schlep is overloaded and has 5 variants to allow additional options. It is simple to use: > SELECT * FROM schlep(user, 'table_name'); This will move the table the BI PostgreSQL database into Vertica. It does the following:  Connects to PostgreSQL BI and obtains the DDL for the table  Maps the data types to Vertica types  Creates the table with correct permissions on Vertica  Copies data over to Vertica by creating a psql COPY process and piping that into vsql (Vertica) COPY process  Is very fast because Vertica does not check constraints by default when data is loaded  Schlep is a “one shot” static snapshot of the data. Once copied to Vertica, there is no further update  Works with PostgreSQL right now, MySQL replication release in early November
  • 22. About autoschlep Autoschlep is a system that allows incremental replication(trickle load) of data from source to Vertica.  Currently works with PostgreSQL, MySQL coming soon  Works by putting an “after” trigger on source data table. Any CRUD (create, update, delete) is recorded by the the trigger in a staging table on the source database  The autoschlep process is scheduled by cron or the autoschlep scheduler, or whatever scheduling system of choice  Autoschlep then uses schlep to move the data from the staging table on the source database and puts it in an identical staging table on Vertica. It then does a MERGE of that data into the target table  Etsy has used schlep and autoschlep to move billions of records to Vertica and keep it synchronized within 15 minutes of the source data Autoschlep is called this way: /autoschlep.py schema_name table_name primary_key
  • 23. Where to Get Vertica and the Schlep Tools Vertica has a FREE version that is quite powerful. It is limited to one node and maximum 1 terabyte of data, but this can be very useful.