Hive et Hadoop Usage chez Square

•

0 gefällt mir•1,017 views

Modern Data Stack France

Technologie

Hadoop and Hive at Square
Nicolas Thiébaud
!
nicothieb@
nicolas@squareup.com
Data Engineering at Square
July 2014

Square: Make
commerce easy
Remove crappy POSes from the counter
Building the best register for small businesses.
Started with card processing and bringing more
value to merchants using the point of sale.
!
Merchant and Buyer facing products
Square Register, Square Cash, Pickup,
Feedback
!
Data products
Merchant Analytics, Capital

Data at Square
Internal Data
!
Produced on app servers (~200+ services),
mysql or psql
!
Logging and tracing from apps and web
to public endpoint
!
Example: payment data, user data, ledger
entries
External Data
!
Payment processing partners ship ﬂat ﬁles
to us
Ofﬂine Data usage at Square
!
BI/Analysis/Reporting: ~200 mysql users,
~100 hadoop users
!
ML: Risk detection, recommendation
!
Apps: A/B testing, Commercial support,
Capital

Data Architecture at
Square: Kafka
Historical, most of our users still use this
App DB -> Analytical DB stripping out PII,
cursoring, looking at binlog replication
!
Hadoop: Kafka as a backbone
App DB -> Kafka using cursoring and PII
stripping
App Server -> Kafka (eg: tracing) in proto
format
Feed consumption -> Kafka
!
Kafka written to hdfs using offsets, dupes
are written when the consumer restarts
!
Raw data is deduped and extracted from
protos to rcﬁles in daily batches. Everything
is exposed in Hive

Most datasets don’t ﬁt in mysql. Most queries
cannot run anymore
Analysts broke down their jobs to run on single
day windows. The query sniper keeps hitting
them.
!
Mysql no longer supported as source of truth
for ofﬂine data. Tables are windowed
We keep revisiting the amount of data stored in
MySQL
!
Everyone must migrate to hive (users and
apps)
Mysql Analytical DBs will now be an export
location for data reduced in Hadoop
!
All datasets must be present in Hadoop
Even small ones :)
Transitioning to Hive

Transitioning to Hive
Stability
!
Hive 10 + Hue 2.5 as starting point + many
patches -> 2 restarts a day with small load
!
Decided to go to hive 12 and patch the
bugs affecting us in an internal build
!
Two major tasks: 10 -> 12 and building
hive internally
Reliability
!
Sentinel, data validation daemon
!
Conduit, hive etls
!
Customer deﬁned SLA’s
Education
!
Ofﬁce hours, trainings, mailing list

Project Babar: Building
a stable Hive 12

Project Babar: Building a stable Hive 12
Patch open source hive to address
Square speciﬁc issues
!
Setup integration tests in kochiku, no
performance test
!
Hiveserver only, no cli. Staging and
production envs
!
Push and pull changes to apache jira
Build and deploy hive artifacts
!
Makeﬁle
!
metastore, hiveserver (staging and prod),
cli tools (beeline), hivesandbox
!
package conﬁguration
Misc
!
hue 3.5
!
hive-udfs

Internal Hive Build
cdh5-0.12.0_5.0.1 branch + 9 commits
3 test ﬁxes, 2 square speciﬁc changes (pom
+ ci)
!
DATAPLAT-436 Beeline should return non-
zero on invalid statements
!
HIVE-5799: session/operation timeout for
hiveserver2
HIVE-5707: Validate values for ConfVar
!
HIVE-7040: Allow TCP keep alive on Hive
Server 2
!
(merged in cdh5-0.12.0_5.0.1) HIVE-6893:
out of sequence error in HiveMetastore

Story of HIVE-7040 + HIVE-5799
HIVE-7040: Allow TCP keep alive on Hive
Server 2
F5 stateful ﬁrewall kills open connections
HIVE-5799: session/operation timeout for
hiveserver2
Beeline interrupt does not close sessions

Hive Ops trick:
./wait_for_hive_jobs && sudo sv restart /var/service/hiveserver

Next Steps
Figure out the best way to contribute back
patches
!
HIVE-668{3,4}: Beeline comments suck
HIVE-7200: Beeline output displays column
heading even if --showHeader=false is set
HIVE-4924: Support JDBC query timeouts
HIVE-5232: Use async interface for jdbc
!
Hive HA
Shark
Tez?

Empfohlen

Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks

Data Warehousing Patterns for HadoopMichelle Ufford

Building Enterprise OLAP on Hadoop for FSILuke Han

Building a Streaming Data Pipeline for Trains Delays ProcessingDatabricks

Building Notebook-based AI Pipelines with Elyra and KubeflowDatabricks

IPC Global Big Data To Decision Solution Overviewpzybrick

Quix presto ide, presto summit ILOri Reshef

RedisConf17 - Real-time Intelligence with Redis-ML and Apache SparkRedis Labs

Empfohlen

Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks

Data Warehousing Patterns for HadoopMichelle Ufford

Building Enterprise OLAP on Hadoop for FSILuke Han

Building a Streaming Data Pipeline for Trains Delays ProcessingDatabricks

Building Notebook-based AI Pipelines with Elyra and KubeflowDatabricks

IPC Global Big Data To Decision Solution Overviewpzybrick

Quix presto ide, presto summit ILOri Reshef

RedisConf17 - Real-time Intelligence with Redis-ML and Apache SparkRedis Labs

Scaling ML-Based Threat Detection For Production Cyber AttacksDatabricks

Azure IaaS-PaaS Migrations - Lessons LearnedJohn Calvert

Create a Chatbot with AWS Lex, Lambda, and HERENic Raboy

Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...Databricks

Optimizing industrial operations using the big data ecosystemDataWorks Summit

Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit

DOES SFO 2016 - Avan Mathur - Planning for Huge ScaleGene Kim

Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle DiagnosticsScyllaDB

Bridging the Gap Between Datasets and DataFramesDatabricks

Building a Self-Service Big Data PipelineDataWorks Summit

Palringo : a startup's journey from a data center to the cloudPhilipBasford

Apache Kylin and Use Cases - 2018 Big Data SpainLuke Han

Codemotion 2014 4ward time series with MongoDBIvan Fioravanti

The New Tech Stack for Device DataRyan Tabora

RedisConf17 - Redis Powers Next-gen Ambient Intelligence PlatformRedis Labs

Rounds analytics pipelineAviv Laufer

Disrupting Big Data with Apache Spark in the CloudJen Aman

Real-Time Robot Predictive Maintenance in ActionDataWorks Summit

Data Driven Decisions at ScaleDatabricks

Spark - Migration Story Roman Chukh

Hug janvier 2016 -EDFModern Data Stack France

Understanding Hadoop through examplesYoshitomo Matsubara

Weitere ähnliche Inhalte

Was ist angesagt?

Scaling ML-Based Threat Detection For Production Cyber AttacksDatabricks

Azure IaaS-PaaS Migrations - Lessons LearnedJohn Calvert

Create a Chatbot with AWS Lex, Lambda, and HERENic Raboy

Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...Databricks

Optimizing industrial operations using the big data ecosystemDataWorks Summit

Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Spark Summit

DOES SFO 2016 - Avan Mathur - Planning for Huge ScaleGene Kim

Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle DiagnosticsScyllaDB

Bridging the Gap Between Datasets and DataFramesDatabricks

Building a Self-Service Big Data PipelineDataWorks Summit

Palringo : a startup's journey from a data center to the cloudPhilipBasford

Apache Kylin and Use Cases - 2018 Big Data SpainLuke Han

Codemotion 2014 4ward time series with MongoDBIvan Fioravanti

The New Tech Stack for Device DataRyan Tabora

RedisConf17 - Redis Powers Next-gen Ambient Intelligence PlatformRedis Labs

Rounds analytics pipelineAviv Laufer

Disrupting Big Data with Apache Spark in the CloudJen Aman

Real-Time Robot Predictive Maintenance in ActionDataWorks Summit

Data Driven Decisions at ScaleDatabricks

Spark - Migration Story Roman Chukh

Was ist angesagt? (20)

Scaling ML-Based Threat Detection For Production Cyber Attacks

Azure IaaS-PaaS Migrations - Lessons Learned

Create a Chatbot with AWS Lex, Lambda, and HERE

Using Azure Databricks, Structured Streaming, and Deep Learning Pipelines to ...

Optimizing industrial operations using the big data ecosystem

Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...

DOES SFO 2016 - Avan Mathur - Planning for Huge Scale

Scylla Summit 2018: Scylla and KairosDB in Smart Vehicle Diagnostics

Bridging the Gap Between Datasets and DataFrames

Building a Self-Service Big Data Pipeline

Palringo : a startup's journey from a data center to the cloud

Apache Kylin and Use Cases - 2018 Big Data Spain

Codemotion 2014 4ward time series with MongoDB

The New Tech Stack for Device Data

RedisConf17 - Redis Powers Next-gen Ambient Intelligence Platform

Rounds analytics pipeline

Disrupting Big Data with Apache Spark in the Cloud

Real-Time Robot Predictive Maintenance in Action

Data Driven Decisions at Scale

Spark - Migration Story

Andere mochten auch

Hug janvier 2016 -EDFModern Data Stack France

Understanding Hadoop through examplesYoshitomo Matsubara

HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Hadoop tools with ExamplesJoe McTee

Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France

Andere mochten auch (7)

Hug janvier 2016 -EDF

Understanding Hadoop through examples

HUG France - 20160114 industrialisation_process_big_data CanalPlus

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Hadoop tools with Examples

Hadoop France meetup Feb2016 : recommendations with spark

Ähnlich wie Hive et Hadoop Usage chez Square

Srikanth hadoop 3.6yrs_hydsrikanth K

Nagarjuna_DamarlaNag Arjun

Running Hadoop as Service in AltiScale PlatformInMobi Technology

Sourav banerjee resumeSourav Banerjee

Mansi KhareMansi Khare

Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu

Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks

Nagarjuna_Damarla_ResumeNag Arjun

Prasanna ResumePrasanna Raju

How Open Source Embiggens Salesforce.comSalesforce Engineering

Yasar resume 2yasar Ahmed Khan

Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_SparkMopuru Babu

Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_SparkMopuru Babu

Hadoop Reporting and Analysis - JaspersoftHortonworks

OOP 2014Emil Andreas Siemes

Feb 2024 Apache Hudi Community Sync with Daniel Fordnadine39280

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit

Summer Shorts: Big Data Integrationibi

HIPAS UCP HSP Openstack Sascha OehlSascha Oehl

Robin_HadoopRobin David

Ähnlich wie Hive et Hadoop Usage chez Square (20)

Srikanth hadoop 3.6yrs_hyd

Nagarjuna_Damarla

Running Hadoop as Service in AltiScale Platform

Sourav banerjee resume

Mansi Khare

Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala

Eric Baldeschwieler Keynote from Storage Developers Conference

Nagarjuna_Damarla_Resume

Prasanna Resume

How Open Source Embiggens Salesforce.com

Yasar resume 2

Sunshine consulting mopuru babu cv_java_j2_ee_spring_bigdata_scala_Spark

Sunshine consulting Mopuru Babu CV_Java_J2ee_Spring_Bigdata_Scala_Spark

Hadoop Reporting and Analysis - Jaspersoft

OOP 2014

Feb 2024 Apache Hudi Community Sync with Daniel Ford

Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud

Summer Shorts: Big Data Integration

HIPAS UCP HSP Openstack Sascha Oehl

Robin_Hadoop

Mehr von Modern Data Stack France

Stash - Data FinOPSModern Data Stack France

Vue d'ensemble DremioModern Data Stack France

From Data Warehouse to LakehouseModern Data Stack France

Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France

Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France

Hugfr SPARK & RIAK -20160114_hug_franceModern Data Stack France

HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France

Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France

Spark dataframeModern Data Stack France

June Spark meetup : search as recommandationModern Data Stack France

Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France

Spark meetup at viadeoModern Data Stack France

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France

Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXModern Data Stack France

The Cascading (big) data application frameworkModern Data Stack France

Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France

Hug france - Administration Hadoop et retour d’expérience BI avec Impala, lim...Modern Data Stack France

HUGFR : Une infrastructure Kafka & Storm pour lutter contre les attaques DDoS...Modern Data Stack France

Mehr von Modern Data Stack France (20)

Stash - Data FinOPS

Vue d'ensemble Dremio

From Data Warehouse to Lakehouse

Talend spark meetup 03042017 - Paris Spark Meetup

Paris Spark Meetup - Trifacta - 03_04_2017

Hugfr SPARK & RIAK -20160114_hug_france

HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)

Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015

Spark dataframe

June Spark meetup : search as recommandation

Spark ML par Xebia (Spark Meetup du 11/06/2015)

Spark meetup at viadeo

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel

Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

The Cascading (big) data application framework

Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014

Hug france - Administration Hadoop et retour d’expérience BI avec Impala, lim...

HUGFR : Une infrastructure Kafka & Storm pour lutter contre les attaques DDoS...

Kürzlich hochgeladen

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes

Manual 508 Accessibility Compliance AuditSkynet Technologies

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Data governance with Unity Catalog PresentationKnoldus Inc.

From Family Reminiscence to Scholarly Archive .Alan Dix

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Kürzlich hochgeladen (20)

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Assure Ecommerce and Retail Operations Uptime with ThousandEyes

Manual 508 Accessibility Compliance Audit

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Moving Beyond Passwords: FIDO Paris Seminar.pdf

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

The State of Passkeys with FIDO Alliance.pptx

What is DBT - The Ultimate Data Build Tool.pdf

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

How AI, OpenAI, and ChatGPT impact business and software.

Scale your database traffic with Read & Write split using MySQL Router

Potential of AI (Generative AI) in Business: Learnings and Insights

Take control of your SAP testing with UiPath Test Suite

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Data governance with Unity Catalog Presentation

From Family Reminiscence to Scholarly Archive .

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Hive et Hadoop Usage chez Square

1. Hadoop and Hive at Square Nicolas Thiébaud ! nicothieb@ nicolas@squareup.com Data Engineering at Square July 2014

2. Square: Make commerce easy Remove crappy POSes from the counter Building the best register for small businesses. Started with card processing and bringing more value to merchants using the point of sale. ! Merchant and Buyer facing products Square Register, Square Cash, Pickup, Feedback ! Data products Merchant Analytics, Capital

6. Data at Square Internal Data ! Produced on app servers (~200+ services), mysql or psql ! Logging and tracing from apps and web to public endpoint ! Example: payment data, user data, ledger entries External Data ! Payment processing partners ship flat files to us Offline Data usage at Square ! BI/Analysis/Reporting: ~200 mysql users, ~100 hadoop users ! ML: Risk detection, recommendation ! Apps: A/B testing, Commercial support, Capital

7. Data Architecture at Square: Kafka Historical, most of our users still use this App DB -> Analytical DB stripping out PII, cursoring, looking at binlog replication ! Hadoop: Kafka as a backbone App DB -> Kafka using cursoring and PII stripping App Server -> Kafka (eg: tracing) in proto format Feed consumption -> Kafka ! Kafka written to hdfs using offsets, dupes are written when the consumer restarts ! Raw data is deduped and extracted from protos to rcﬁles in daily batches. Everything is exposed in Hive

8. Most datasets don’t ﬁt in mysql. Most queries cannot run anymore Analysts broke down their jobs to run on single day windows. The query sniper keeps hitting them. ! Mysql no longer supported as source of truth for ofﬂine data. Tables are windowed We keep revisiting the amount of data stored in MySQL ! Everyone must migrate to hive (users and apps) Mysql Analytical DBs will now be an export location for data reduced in Hadoop ! All datasets must be present in Hadoop Even small ones :) Transitioning to Hive

9. Transitioning to Hive Stability ! Hive 10 + Hue 2.5 as starting point + many patches -> 2 restarts a day with small load ! Decided to go to hive 12 and patch the bugs affecting us in an internal build ! Two major tasks: 10 -> 12 and building hive internally Reliability ! Sentinel, data validation daemon ! Conduit, hive etls ! Customer deﬁned SLA’s Education ! Ofﬁce hours, trainings, mailing list

10. Project Babar: Building a stable Hive 12

11. Project Babar: Building a stable Hive 12 Patch open source hive to address Square specific issues ! Setup integration tests in kochiku, no performance test ! Hiveserver only, no cli. Staging and production envs ! Push and pull changes to apache jira Build and deploy hive artifacts ! Makefile ! metastore, hiveserver (staging and prod), cli tools (beeline), hivesandbox ! package configuration Misc ! hue 3.5 ! hive-udfs

12. Internal Hive Build cdh5-0.12.0_5.0.1 branch + 9 commits 3 test ﬁxes, 2 square speciﬁc changes (pom + ci) ! DATAPLAT-436 Beeline should return non- zero on invalid statements ! HIVE-5799: session/operation timeout for hiveserver2 HIVE-5707: Validate values for ConfVar ! HIVE-7040: Allow TCP keep alive on Hive Server 2 ! (merged in cdh5-0.12.0_5.0.1) HIVE-6893: out of sequence error in HiveMetastore

13. Story of HIVE-7040 + HIVE-5799 HIVE-7040: Allow TCP keep alive on Hive Server 2 F5 stateful ﬁrewall kills open connections HIVE-5799: session/operation timeout for hiveserver2 Beeline interrupt does not close sessions

14. Hive Ops trick: ./wait_for_hive_jobs && sudo sv restart /var/service/hiveserver

15. Next Steps Figure out the best way to contribute back patches ! HIVE-668{3,4}: Beeline comments suck HIVE-7200: Beeline output displays column heading even if --showHeader=false is set HIVE-4924: Support JDBC query timeouts HIVE-5232: Use async interface for jdbc ! Hive HA Shark Tez?