SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Rebuilding from MongoDB for Scale on HBase
Robert Roland (@robdaemon)
Lead Software Engineer
rob@simplymeasured.com
http://www.simplymeasured.com
© 2013 Simply Measured, Inc
Who Are We
Social Media Analytics
Serving 25% of the Interbrand Top 100 Global Brands
Collecting data from Twitter, Facebook, Instagram, YouTube,
Google Analytics and more
Delivered in a marketer’s favorite format – Excel!
GeekWire’s 2013 Startup of the Year
2
© 2013 Simply Measured, Inc
Twitter Account Report, jetBlue Airlines
3
© 2013 Simply Measured, Inc
Complete Social Media Snapshot, Red Bull
4
© 2013 Simply Measured, Inc
Our Setup
Cloudera CDH 4.2.1 on Ubuntu 12.04
3 “control” nodes (HDFS name node, job tracker, HBase master)
11 “data” nodes (HDFS data nodes, task trackers, HBase region servers)
Bare-metal, using managed colo hosting
14 TB of data across 50 HBase tables (One table per data source -Twitter,
Facebook, plus secondary indexes and operations)
Our customers have tracked 1.5 billion Tweets so far, and growing
More than 5 million rows across 3,000 reports generated daily
5
© 2013 Simply Measured, Inc
Why did we start with MongoDB?
• Ease of development
• Lack of schema is a good and a bad thing
• Easy to use libraries in Ruby (mongomatic)
• Lower initial investment
• You can run MongoDB on one server in production (but you shouldn’t)
• Master/slave was easier to start with than the current sharding model
• Active, engaged community
• Really, really effective marketing masks MongoDB's
shortcomings…
6
© 2013 Simply Measured, Inc
Why leave MongoDB?
• 10 TB of data was too much for it
• Instability
• No one wants to restart mongos every two days
• Bugs
• Very public, very high profile failures
• Silent failure modes
• What do you mean, the index creation failed and you just attempted
to load an entire collection into RAM?
7
© 2013 Simply Measured, Inc
Why HBase, or, How is this better than Mongo?
• More linear scaling
• Add new nodes, run a major compaction
• Our data can easily be modeled as a sparse column store
• We value consistency
• Our users will tell us if we’re missing one Tweet out of 1 million
• Monitoring
• Metrics
• Stability
• Excellent vendor support
8
© 2013 Simply Measured, Inc
How did we migrate?
• Implemented the Strangler pattern
• Dual writes to Mongo and HBase
• Migrate older data a few customers at a time, a few data sources at a
time
• Dual report generation platform – enabled us to compare reports off
our MongoDB platform and our HBase platform
• Migrated existing Ruby 1.8 code to JRuby
• Direct access to the HBase cluster
• I’m a better Java developer than Ruby developer
• Great profiler tools
9
© 2013 Simply Measured, Inc
Challenges with HBase
• Out of the box configuration is not good enough
• GC Tuning
• Default file size of 256 mb is too small
• Compactions will eat you alive
• Be sure to enable the HDFS trashcan! (fs.trash.interval)
• Configurations can be difficult to manage
• Chef / Puppet if you want to roll your own
• Cloudera Manager
• Schema
• Lack of types means the HBase shell is harder to read
• No native secondary indexes
10
© 2013 Simply Measured, Inc
What’s been awesome
• Great user community on the mailing lists
• Source code is easy to follow and submit patches
• Stable, stable, stable
• Unless you configure something wrong, like ulimits or xcievers!
11
© 2013 Simply Measured, Inc
Google Analytics schema
• Row key
• salt|dataSourceId|dimension|…|date
• Columns
• CF: metadata
• dataSourceId – hash, identifies a data source mapping to a customer
• timezone – Google Analytics time zone for this data
• dimensions – delimited list of Google Analytics dimensions
• CF: data
• GA data key – Name of metric as defined by Google Analytics, Protocol
Buffer representing value
12
© 2013 Simply Measured, Inc
Schema evolution
• Enrichment via Klout, geolocation data, Bit.ly, etc. was stored
in a separate table, and “joined” at query time
• Enrichment is now a column family within each data source
• Started with Protocol Buffers, now writing as qualifiers and
columns
• Ability to use server-side filters, ease of use within HBase shell
• Protocol Buffers still exist in some cases (example coming up)
• Rekeyed several tables over time
• Dual write during migration
• Map/reduce jobs to migrate data
• More column families per table than were necessary
• Lots and lots of memstore pain.
13
© 2013 Simply Measured, Inc
Protocol Buffers
• Union-like Protocol Buffer for arbitrary key/value pairs
14
© 2013 Simply Measured, Inc
Tips and Tricks
• Don’t expect to get your row keys correct on the first try
• Look for hot spotting
• Does ordering matter during queries?
• Consider your backup strategy
• S3 seems to work for us
• Replication is also an option
• This is not an RDBMS. Don’t JOIN, denormalize!
15
© 2013 Simply Measured, Inc
What’s coming up
• Hive
• Easier querying
• Entirely standardized representation of our data
• HCatalog
• Expose our schema to other internal tooling
• A much larger cluster
• Even more data sources
16
Thank You
Robert Roland
@robdaemon
rob@simplymeasured.com
We’re hiring!
http://simplymeasured.com/about/careers/
Our Tech Blog: http://engineering.simplymeasured.com/
Our Open Source: http://simplymeasured.github.io/
More sample reports: http://simplymeasured.com/tour/sample-reports/

Weitere ähnliche Inhalte

Was ist angesagt?

Maximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWSMaximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWSMongoDB
 
FOSSASIA 2016 - 7 Tips to design web centric high-performance applications
FOSSASIA 2016 - 7 Tips to design web centric high-performance applicationsFOSSASIA 2016 - 7 Tips to design web centric high-performance applications
FOSSASIA 2016 - 7 Tips to design web centric high-performance applicationsAshnikbiz
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
Building a Scalable and Modern Infrastructure at CARFAX
Building a Scalable and Modern Infrastructure at CARFAXBuilding a Scalable and Modern Infrastructure at CARFAX
Building a Scalable and Modern Infrastructure at CARFAXMongoDB
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewPierre Baillet
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres Regunath B
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewNorberto Leite
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBMongoDB
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBMongoDB
 
Webinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB AtlasWebinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB AtlasMongoDB
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantageRegunath B
 
Introduction to MongoDB Enterprise
Introduction to MongoDB EnterpriseIntroduction to MongoDB Enterprise
Introduction to MongoDB EnterpriseMongoDB
 

Was ist angesagt? (20)

Hadoop and friends
Hadoop and friendsHadoop and friends
Hadoop and friends
 
Maximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWSMaximizing MongoDB Performance on AWS
Maximizing MongoDB Performance on AWS
 
FOSSASIA 2016 - 7 Tips to design web centric high-performance applications
FOSSASIA 2016 - 7 Tips to design web centric high-performance applicationsFOSSASIA 2016 - 7 Tips to design web centric high-performance applications
FOSSASIA 2016 - 7 Tips to design web centric high-performance applications
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Building a Scalable and Modern Infrastructure at CARFAX
Building a Scalable and Modern Infrastructure at CARFAXBuilding a Scalable and Modern Infrastructure at CARFAX
Building a Scalable and Modern Infrastructure at CARFAX
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
 
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDBHBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
HBaseCon 2015: Industrial Internet Case Study using HBase and TSDB
 
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres E commerce data migration in moving systems across data centres
E commerce data migration in moving systems across data centres
 
MongoDB 3.2 Feature Preview
MongoDB 3.2 Feature PreviewMongoDB 3.2 Feature Preview
MongoDB 3.2 Feature Preview
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
Webinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB AtlasWebinar: Simplifying the Database Experience with MongoDB Atlas
Webinar: Simplifying the Database Experience with MongoDB Atlas
 
Oss as a competitive advantage
Oss as a competitive advantageOss as a competitive advantage
Oss as a competitive advantage
 
Introduction to MongoDB Enterprise
Introduction to MongoDB EnterpriseIntroduction to MongoDB Enterprise
Introduction to MongoDB Enterprise
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 

Ähnlich wie Rebuilding from MongoDB for Scale on HBase

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalVMware Tanzu Korea
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading StrategiesMongoDB
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseCloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...Cloudera, Inc.
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...WebExpo
 
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolEDB
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 

Ähnlich wie Rebuilding from MongoDB for Scale on HBase (20)

Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
MongoDB Breakfast Milan - Mainframe Offloading Strategies
MongoDB Breakfast Milan -  Mainframe Offloading StrategiesMongoDB Breakfast Milan -  Mainframe Offloading Strategies
MongoDB Breakfast Milan - Mainframe Offloading Strategies
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
Ameya Kanitkar: Using Hadoop and HBase to Personalize Web, Mobile and Email E...
 
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
MongoDB .local Houston 2019: Building an IoT Streaming Analytics Platform to ...
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Dataweek-Talk-2014
Dataweek-Talk-2014Dataweek-Talk-2014
Dataweek-Talk-2014
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 

Kürzlich hochgeladen

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Rebuilding from MongoDB for Scale on HBase

  • 1. Rebuilding from MongoDB for Scale on HBase Robert Roland (@robdaemon) Lead Software Engineer rob@simplymeasured.com http://www.simplymeasured.com
  • 2. © 2013 Simply Measured, Inc Who Are We Social Media Analytics Serving 25% of the Interbrand Top 100 Global Brands Collecting data from Twitter, Facebook, Instagram, YouTube, Google Analytics and more Delivered in a marketer’s favorite format – Excel! GeekWire’s 2013 Startup of the Year 2
  • 3. © 2013 Simply Measured, Inc Twitter Account Report, jetBlue Airlines 3
  • 4. © 2013 Simply Measured, Inc Complete Social Media Snapshot, Red Bull 4
  • 5. © 2013 Simply Measured, Inc Our Setup Cloudera CDH 4.2.1 on Ubuntu 12.04 3 “control” nodes (HDFS name node, job tracker, HBase master) 11 “data” nodes (HDFS data nodes, task trackers, HBase region servers) Bare-metal, using managed colo hosting 14 TB of data across 50 HBase tables (One table per data source -Twitter, Facebook, plus secondary indexes and operations) Our customers have tracked 1.5 billion Tweets so far, and growing More than 5 million rows across 3,000 reports generated daily 5
  • 6. © 2013 Simply Measured, Inc Why did we start with MongoDB? • Ease of development • Lack of schema is a good and a bad thing • Easy to use libraries in Ruby (mongomatic) • Lower initial investment • You can run MongoDB on one server in production (but you shouldn’t) • Master/slave was easier to start with than the current sharding model • Active, engaged community • Really, really effective marketing masks MongoDB's shortcomings… 6
  • 7. © 2013 Simply Measured, Inc Why leave MongoDB? • 10 TB of data was too much for it • Instability • No one wants to restart mongos every two days • Bugs • Very public, very high profile failures • Silent failure modes • What do you mean, the index creation failed and you just attempted to load an entire collection into RAM? 7
  • 8. © 2013 Simply Measured, Inc Why HBase, or, How is this better than Mongo? • More linear scaling • Add new nodes, run a major compaction • Our data can easily be modeled as a sparse column store • We value consistency • Our users will tell us if we’re missing one Tweet out of 1 million • Monitoring • Metrics • Stability • Excellent vendor support 8
  • 9. © 2013 Simply Measured, Inc How did we migrate? • Implemented the Strangler pattern • Dual writes to Mongo and HBase • Migrate older data a few customers at a time, a few data sources at a time • Dual report generation platform – enabled us to compare reports off our MongoDB platform and our HBase platform • Migrated existing Ruby 1.8 code to JRuby • Direct access to the HBase cluster • I’m a better Java developer than Ruby developer • Great profiler tools 9
  • 10. © 2013 Simply Measured, Inc Challenges with HBase • Out of the box configuration is not good enough • GC Tuning • Default file size of 256 mb is too small • Compactions will eat you alive • Be sure to enable the HDFS trashcan! (fs.trash.interval) • Configurations can be difficult to manage • Chef / Puppet if you want to roll your own • Cloudera Manager • Schema • Lack of types means the HBase shell is harder to read • No native secondary indexes 10
  • 11. © 2013 Simply Measured, Inc What’s been awesome • Great user community on the mailing lists • Source code is easy to follow and submit patches • Stable, stable, stable • Unless you configure something wrong, like ulimits or xcievers! 11
  • 12. © 2013 Simply Measured, Inc Google Analytics schema • Row key • salt|dataSourceId|dimension|…|date • Columns • CF: metadata • dataSourceId – hash, identifies a data source mapping to a customer • timezone – Google Analytics time zone for this data • dimensions – delimited list of Google Analytics dimensions • CF: data • GA data key – Name of metric as defined by Google Analytics, Protocol Buffer representing value 12
  • 13. © 2013 Simply Measured, Inc Schema evolution • Enrichment via Klout, geolocation data, Bit.ly, etc. was stored in a separate table, and “joined” at query time • Enrichment is now a column family within each data source • Started with Protocol Buffers, now writing as qualifiers and columns • Ability to use server-side filters, ease of use within HBase shell • Protocol Buffers still exist in some cases (example coming up) • Rekeyed several tables over time • Dual write during migration • Map/reduce jobs to migrate data • More column families per table than were necessary • Lots and lots of memstore pain. 13
  • 14. © 2013 Simply Measured, Inc Protocol Buffers • Union-like Protocol Buffer for arbitrary key/value pairs 14
  • 15. © 2013 Simply Measured, Inc Tips and Tricks • Don’t expect to get your row keys correct on the first try • Look for hot spotting • Does ordering matter during queries? • Consider your backup strategy • S3 seems to work for us • Replication is also an option • This is not an RDBMS. Don’t JOIN, denormalize! 15
  • 16. © 2013 Simply Measured, Inc What’s coming up • Hive • Easier querying • Entirely standardized representation of our data • HCatalog • Expose our schema to other internal tooling • A much larger cluster • Even more data sources 16
  • 17. Thank You Robert Roland @robdaemon rob@simplymeasured.com We’re hiring! http://simplymeasured.com/about/careers/ Our Tech Blog: http://engineering.simplymeasured.com/ Our Open Source: http://simplymeasured.github.io/ More sample reports: http://simplymeasured.com/tour/sample-reports/

Hinweis der Redaktion

  1. Introduce myself
  2. Twitter and Klout metrics, bit.ly expansion, geolocation
  3. Facebook, YouTube, Google+, Twitter, Instagram
  4. Mention recent upgrade to CDH4 from CDH3
  5. Mention initial rollout didn’t use Cloudera ManagerTalk about chef
  6. Mention clock skew
  7. Could have flattened into columns, use a Hungarian-style notation, etc
  8. Discuss our “enrichment” idea here with the denormalize part
  9. Mention Impala or other options