The Evolution of Hadoop at Stripe

•

3 gefällt mir•2,510 views

Colin Marc

Technologie Sport

ABOUT STRIPE
• payments for the web
• based in SF
• last time I checked, ~75 people
(stripe.com/about)
• main product is an API

A LOT OF OUR
DATA IS IN MONGO
• MongoDB is a fantastic application
database
• uses BSON - like JSON, but has a
binary representation
• MongoDB is schemaless, but has
indexed queries and other
features that are nice for
applications

APPLICATION DBS
SUCK FOR ANALYSIS
• well, sometimes. relational
databases are OK
• MongoDB is awful (for this)
• no joins
• scans are painful
• no declarative query language

V1:
TSV + IMPALA
• threw together a Hadoop cluster
on the developer boxes
script dumped models to
• “nightly” in HDFS
TSV ﬁles
script output
• jankyour models the schema
from
• query from Impala

ASIDE: IMPALA IS
PRETTY COOL
• developed by Cloudera
• absurdly fast queries over HDFS
• SQL is great
• most of our questions are ad-hoc
secrets =(
woah

A NICE
EXPERIMENT, BUT...
• schema translation is hard
• SLOW SLOW SLOW
• TSV is not a great format
• script never runs
• not production data

V2:
MONGO -> HBASE
• Impala can query HBase, I think?
wrote MoSQL - let’s do
• @nelhagething, but put the data in
the same
HBase!
from
• translatingeasier one k/v store to
another is

ZEROWING
http://github.com/stripe/zerowing

FIRST, SNAPSHOT
Mongo-Hadoop, map
• usingMongoDB database over
your
• HFileOutputFormat,
completeBulkLoad

THEN, STREAM
MongoDB oplog, like a
• tail the set member
replica
• replicate inserts/updates/deletes
by _id

THEN, QUERY IT
WITH IMPALA...UM
• wait, impala can’t actually query
HBase effectively
• 30-40x slower over the same
data
• limitingI factor is HBase scan
speed, think

LOST IN
TRANSLATION
• our schema problem is still there!
• BSON is typed, but HBase is just
strings
• nested hashes still don’t work
• lists???
• what is the canonical schema?

V3:
PARQUET + THRIFT
storing k/v pairs,
• instead ofraw BSON blobs just
store the
• write your MR jobs against HBase
if you want up-to-date data
• also periodically dump out
Parquet ﬁles
• use thrift deﬁnitions to manage
schema

USING THRIFT AS
SCHEMA
nice way
• thrift is a expect toto deﬁne what
ﬁelds we
be in the
BSON

• in most cases, we can do the
translation automatically
on the backend, instead of
• decodereplication
during
• no information loss

GENERATE THRIFT
DEFINITIONS?
• thrift still isn’t the canonical- that
schema for our application
exists in our ODM

• wrote a quick ruby script to
generate thrift deﬁnitions from
our application models

PARQUET <3
THRIFT
• columnar, read-optimized
a little bit
• withbasic thrift of glue, serialize
any
struct easily

IMPALA <3
PARQUET
• more glue can automatically
import parquet ﬁles into Impala
designed
• Impala and parquet areother
to work well with each
• nested structs don’t work yet =(

SCALDING <3
PARQUET
• we use scalding for a lot of
MapReduce stuff
• added ParquetSource to scalding
to make this easy (source and
sink)

THIS WORKS FOR
ANY DATA
• use thrift to deﬁne an data type,
intermediate or derived
and you get, for free:

• serialization using parquet
• easy MR jobs with scalding
• ad-hoc querying with Impala

MongoDB

Application
Land

ZeroWing

OVERVIEW

HBase
Hadoop
MR

Impala
Parquet
Snapshots

Hadoop Land

QUESTIONS?
• meeeee: @colinmarc
• Stripe: stripe.com
• we’re hiring! stripe.com/jobs
• ZeroWing: github.com/stripe/zerowing
• Impala: github.com/cloudera/impala
• Parquet: parquet.github.com

Empfohlen

From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge

Ruby performance - The low hanging fruitBruce Werdschinski

Flickr Architecture Presentationeraz

Scala at foursquarejorgeortiz85

Infrastructure, Hiphop for PHP, deploy @ HyvesMarco Londero

Auto-Scalable REST APIs with YAWP! and Google CloudFernando Ultremare

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks

Apcera Case Study: The selection of the Go languageDerek Collison

Empfohlen

From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge

Ruby performance - The low hanging fruitBruce Werdschinski

Flickr Architecture Presentationeraz

Scala at foursquarejorgeortiz85

Infrastructure, Hiphop for PHP, deploy @ HyvesMarco Londero

Auto-Scalable REST APIs with YAWP! and Google CloudFernando Ultremare

Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Thoughtworks

Apcera Case Study: The selection of the Go languageDerek Collison

Railsで作るBFFの功罪Recruit Lifestyle Co., Ltd.

Day 9 - PostgreSQL Application ArchitectureBarry Jones

What Every Developer Should Know About Database Scalabilityjbellis

Migrating applications to serverless Apache Kafka + KSQLconfluent

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent

GoSF Summerfest - Why Go at ApceraDerek Collison

Technologies for Data Analytics PlatformN Masahiro

Powering an API with GraphQL, Golang, and NoSQLNic Raboy

PHP, LAMP Stack & WordPressSuman Srinivasan

Drop acidMike Feltman

Riding rails for 10 yearsjduff

Why we love ArangoDB. The hunt for the right NosQL DatabaseAndreas Jung

Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI

AWS Cloud experience concepts tips and tricksDirk Harms-Merbitz

Day 2 - Intro to RailsBarry Jones

PharoDAYS 2015: On Relational Databases by Guille PolitoPharo

MongoDB as a fast and queryable cacheMongoDB

The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede

Ubiquitous Solr - A Database's not-so-evil TwinAyon Sinha

Communication tool & Environment for Remote WorkerShotaro Sakamaki

Braintree and our new v.zero SDK for iOSAlberto López Martín

Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Codemotion

Weitere ähnliche Inhalte

Was ist angesagt?

Railsで作るBFFの功罪Recruit Lifestyle Co., Ltd.

Day 9 - PostgreSQL Application ArchitectureBarry Jones

What Every Developer Should Know About Database Scalabilityjbellis

Migrating applications to serverless Apache Kafka + KSQLconfluent

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...HostedbyConfluent

GoSF Summerfest - Why Go at ApceraDerek Collison

Technologies for Data Analytics PlatformN Masahiro

Powering an API with GraphQL, Golang, and NoSQLNic Raboy

PHP, LAMP Stack & WordPressSuman Srinivasan

Drop acidMike Feltman

Riding rails for 10 yearsjduff

Why we love ArangoDB. The hunt for the right NosQL DatabaseAndreas Jung

Planet-scale Data Ingestion Pipeline: BigdamSATOSHI TAGOMORI

AWS Cloud experience concepts tips and tricksDirk Harms-Merbitz

Day 2 - Intro to RailsBarry Jones

PharoDAYS 2015: On Relational Databases by Guille PolitoPharo

MongoDB as a fast and queryable cacheMongoDB

The Many Faces of Apache Kafka: Leveraging real-time data at scaleNeha Narkhede

Ubiquitous Solr - A Database's not-so-evil TwinAyon Sinha

Communication tool & Environment for Remote WorkerShotaro Sakamaki

Was ist angesagt? (20)

Railsで作るBFFの功罪

Day 9 - PostgreSQL Application Architecture

What Every Developer Should Know About Database Scalability

Migrating applications to serverless Apache Kafka + KSQL

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...

GoSF Summerfest - Why Go at Apcera

Technologies for Data Analytics Platform

Powering an API with GraphQL, Golang, and NoSQL

PHP, LAMP Stack & WordPress

Drop acid

Riding rails for 10 years

Why we love ArangoDB. The hunt for the right NosQL Database

Planet-scale Data Ingestion Pipeline: Bigdam

AWS Cloud experience concepts tips and tricks

Day 2 - Intro to Rails

PharoDAYS 2015: On Relational Databases by Guille Polito

MongoDB as a fast and queryable cache

The Many Faces of Apache Kafka: Leveraging real-time data at scale

Ubiquitous Solr - A Database's not-so-evil Twin

Communication tool & Environment for Remote Worker

Andere mochten auch

Braintree and our new v.zero SDK for iOSAlberto López Martín

Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...Codemotion

Django Zebra Lightning TalkLee Trout

Online learning talkEmily Chin

Paymill vs Stripebetabeers

Machine Learning Experimentation at Sift ScienceSift Science

Pay and Get Paid: How To Integrate Stripe Into Your AppFlatiron School

Omise fintech研究会Jun Hasegawa

[daddly] Stripe勉強会運用編 2016/11/30Naoshi ONO

Entrepreneur + Developer Gangbang: Co-workingkamal.fariz

Payments using Stripe.comBilly Cravens

Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...GreenhouseSoftware

Bitcoin ,Vidyaranya D

Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...Alberto López Martín

Payments integration: Stripe & TaxamoNetguru

Payments Made Easy with StripeShawn Hooper

Getting started with StripeTechMagic

FinTech Hong Kong Report CFTE

Andere mochten auch (18)

Braintree and our new v.zero SDK for iOS

Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...

Django Zebra Lightning Talk

Online learning talk

Paymill vs Stripe

Machine Learning Experimentation at Sift Science

Pay and Get Paid: How To Integrate Stripe Into Your App

Omise fintech研究会

[daddly] Stripe勉強会運用編 2016/11/30

Entrepreneur + Developer Gangbang: Co-working

Payments using Stripe.com

Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...

Bitcoin ,

Braintree SDK v.zero or "A payment gateway walks into a bar..." - Devfest Nan...

Payments integration: Stripe & Taxamo

Payments Made Easy with Stripe

Getting started with Stripe

FinTech Hong Kong Report

Ähnlich wie The Evolution of Hadoop at Stripe

HBase and Hadoop at Urban Airshipdave_revell

Hadoop in a Windows Shop - CHUG - 20120416Chicago Hadoop Users Group

Scaling with swaggerTony Tam

Why ruby and railsReuven Lerner

Apache DrillTed Dunning

Running MongoDB in the CloudTony Tam

Liferay & Big Data Dev Con 2014Miguel Pastor

Inside Wordnik's ArchitectureTony Tam

Hybrid MongoDB and RDBMS ApplicationsSteven Francia

Make Life Suck Less (Building Scalable Systems)guest0f8e278

Messaging, interoperability and log aggregation - a new frameworkTomas Doran

MongoDBRony Gregory

Hbase jddAndrzej Grzesik

Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.

Make Life Suck Less (Building Scalable Systems)Bradford Stephens

Intro to HBase - Lars GeorgeJAX London

Message:Passing - lpw 2012Tomas Doran

Impala presentationtrihug

WordCamp 2012 - WordPress Webappstjasko

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust

Ähnlich wie The Evolution of Hadoop at Stripe (20)

HBase and Hadoop at Urban Airship

Hadoop in a Windows Shop - CHUG - 20120416

Scaling with swagger

Why ruby and rails

Apache Drill

Running MongoDB in the Cloud

Liferay & Big Data Dev Con 2014

Inside Wordnik's Architecture

Hybrid MongoDB and RDBMS Applications

Make Life Suck Less (Building Scalable Systems)

Messaging, interoperability and log aggregation - a new framework

MongoDB

Hbase jdd

Cloudera Impala: A Modern SQL Engine for Hadoop

Make Life Suck Less (Building Scalable Systems)

Intro to HBase - Lars George

Message:Passing - lpw 2012

Impala presentation

WordCamp 2012 - WordPress Webapps

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012

Kürzlich hochgeladen

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Exploring Multimodal Embeddings with MilvusZilliz

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Manulife - Insurer Transformation Award 2024The Digital Insurer

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Kürzlich hochgeladen (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Strategies for Landing an Oracle DBA Job as a Fresher

How to Troubleshoot Apps for the Modern Connected Worker

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

FWD Group - Insurer Innovation Award 2024

Exploring Multimodal Embeddings with Milvus

Boost Fertility New Invention Ups Success Rates.pdf

Manulife - Insurer Transformation Award 2024

[BuildWithAI] Introduction to Gemini.pdf

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Apidays New York 2024 - The value of a flexible API Management solution for O...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Exploring the Future Potential of AI-Enabled Smartphone Processors

MS Copilot expands with MS Graph connectors

Axa Assurance Maroc - Insurer Innovation Award 2024

The Evolution of Hadoop at Stripe

1. THE EVOLUTION OF HADOOP AT STRIPE colin marc @colinmarc

2. ABOUT STRIPE • payments for the web • based in SF • last time I checked, ~75 people (stripe.com/about) • main product is an API

3. WITH US, DATA WAS AN AFTERTHOUGHT

4. A LOT OF OUR DATA IS IN MONGO • MongoDB is a fantastic application database • uses BSON - like JSON, but has a binary representation • MongoDB is schemaless, but has indexed queries and other features that are nice for applications

5. APPLICATION DBS SUCK FOR ANALYSIS • well, sometimes. relational databases are OK • MongoDB is awful (for this) • no joins • scans are painful • no declarative query language

6. SOLUTION: PUT THE DATA SOMEWHERE ELSE

7. V1: TSV + IMPALA • threw together a Hadoop cluster on the developer boxes script dumped models to • “nightly” in HDFS TSV ﬁles script output • jankyour models the schema from • query from Impala

8. ASIDE: IMPALA IS PRETTY COOL • developed by Cloudera • absurdly fast queries over HDFS • SQL is great • most of our questions are ad-hoc secrets =( woah

9. A NICE EXPERIMENT, BUT... • schema translation is hard • SLOW SLOW SLOW • TSV is not a great format • script never runs • not production data

10. V2: MONGO -> HBASE • Impala can query HBase, I think? wrote MoSQL - let’s do • @nelhagething, but put the data in the same HBase! from • translatingeasier one k/v store to another is

11. ZEROWING http://github.com/stripe/zerowing

12. FIRST, SNAPSHOT Mongo-Hadoop, map • usingMongoDB database over your • HFileOutputFormat, completeBulkLoad

13. THEN, STREAM MongoDB oplog, like a • tail the set member replica • replicate inserts/updates/deletes by _id

14. HAVING DATA IN HDFS IS A GREAT

15. THEN, QUERY IT WITH IMPALA...UM • wait, impala can’t actually query HBase effectively • 30-40x slower over the same data • limitingI factor is HBase scan speed, think

16. LOST IN TRANSLATION • our schema problem is still there! • BSON is typed, but HBase is just strings • nested hashes still don’t work • lists??? • what is the canonical schema?

17.

18. V3: PARQUET + THRIFT storing k/v pairs, • instead ofraw BSON blobs just store the • write your MR jobs against HBase if you want up-to-date data • also periodically dump out Parquet ﬁles • use thrift deﬁnitions to manage schema

19. USING THRIFT AS SCHEMA nice way • thrift is a expect toto deﬁne what ﬁelds we be in the BSON • in most cases, we can do the translation automatically on the backend, instead of • decodereplication during • no information loss

20. GENERATE THRIFT DEFINITIONS? • thrift still isn’t the canonical- that schema for our application exists in our ODM • wrote a quick ruby script to generate thrift deﬁnitions from our application models

21. PARQUET <3 THRIFT • columnar, read-optimized a little bit • withbasic thrift of glue, serialize any struct easily

22. IMPALA <3 PARQUET • more glue can automatically import parquet ﬁles into Impala designed • Impala and parquet areother to work well with each • nested structs don’t work yet =(

23. SCALDING <3 PARQUET • we use scalding for a lot of MapReduce stuff • added ParquetSource to scalding to make this easy (source and sink)

24. THIS WORKS FOR ANY DATA • use thrift to deﬁne an data type, intermediate or derived and you get, for free: • serialization using parquet • easy MR jobs with scalding • ad-hoc querying with Impala

25. MongoDB Application Land ZeroWing OVERVIEW HBase Hadoop MR Impala Parquet Snapshots Hadoop Land

26. QUESTIONS? • meeeee: @colinmarc • Stripe: stripe.com • we’re hiring! stripe.com/jobs • ZeroWing: github.com/stripe/zerowing • Impala: github.com/cloudera/impala • Parquet: parquet.github.com