Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual

•

2 gefällt mir•977 views

This talk will explore how factual uses some of the advanced features in version .98 and beyond to build a robust and flexible data store.

Technologie

HBase and Batch Processing
Molly O’Connor
Factual

HBase at Factual
Support API for global location queries and accept live
writes of new supporting data
Batch updates: ingesting large amounts of new data,
pushing out new versions of the data (improvements in
algorithms for data cleaning, verification, clustering)

Overview
1. HBase Architecture
2. Batch Workflow
3. Lessons/Challenges

HBase Intro -- Data Model
● Column families for HBase table are specified at
creation time
● Arbitrary byte sequences for column qualifiers (unlimited
and created as data is written)
● Data organized by column families and sorted by key

HBase Intro -- HFile Format
● Sorted lexicographically with secondary indices inline
with data
● Block size: memory tradeoffs. Choose based on
expected read access
● Compression: experiment: lzo, snappy, gz
● Index Size: don’t make keys and column names longer
than needed.

HBase Intro -- Locality
● Region servers write new data locally
● Compaction further promotes data locality
● Metrics: at the region server level. In 1.0, “Block
Locality”, in 0.94, “hdfsBlocksLocalityIndex”
● Enable short circuit reads for additional benefits

HBase Intro -- Consistency
● Single row atomicity across column families guaranteed
● checkAndPut -- single row, checks on the value of a
single column only
● mutateRowsWithLocks --via coprocessor
○ within a region: need clever row key design, region
split policy

● Better performance for large scale updates
● Quality analysis and metrics on all data before adopting
● Perform computations that are not possible or
prohibitively expensive in a live hbase setting
● Data is already on HDFS
HBase and Batch

Batch Workflow
1. Snapshot
2. MapReduce
3. Bulkloading

HBase Snapshots
● Copy on write (HFile links)
● Per table
● Rolling
○ HBase’s guarantee of consistency within a row
● Use cases: backup/recovery, export, mapreduce

Snapshots and MapReduce
● Definitely use Mapreduce over snapshots, if possible
(HBASE-8369)
○ before this feature, issues with reading HFiles
directly because of compaction
○ Advantages: Job is faster and puts less pressure on
region servers
○ Caveat: Not reading live data

Locality and Mapreduce
Tradeoffs: want to colocate computation with data. But, this
causes contention with HBase
○ Don’t run mapreduce on HBase nodes?
○ Mitigated somewhat with Yarn?

Federated Namespaces for Multi-Purpose Cluster

Bulkloading
● Additional path to ingesting data with HBase. Create
HFiles directly, and HBase adopts the files
● Bulkload is atomic at the region level
○ single row consistency (across CF) guaranteed

Locality after Bulkloading
● Look at current region locations, and try to produce new
HFiles on those nodes
● Compaction after bulkloading: needs to be timed well,
but can eventually lead to locality
● New data will not be in the block cache!
● HBASE-11195 can promote compaction in cases where
locality is low
● HBASE-8329 throttling compaction speed

Challenges
● Does bulkloading fit your data model?
○ Replay--do you need a catchup phase after data
ingestion?
● Consistency beyond row level (using a library to
manage a secondary index or other transactional
writes)?
● Maybe use Mapreduce over live tables but throttle
requests? HBASE-11598

Summary
1. Ability to do bulk updates can be hugely important for
performance
2. HDFS integration and strong feature set make HBase a
good choice for batch processing
3. More features coming (today highlighted many
introduced in the new version 1.0)

Weitere ähnliche Inhalte

Was ist angesagt?

Apache HBase Vishnupriya T H

RocksDB detailMIJIN AN

Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.

RocksDB compactionMIJIN AN

Incremental backupsVlad Lesin

TriHUG 3/14: HBase in Productiontrihug

The Hive Think Tank: Rocking the Database World with RocksDBThe Hive

The Google BigtableRomain Jacotin

Web scale monitoringDobrica Pavlinušić

RTree Spatial Indexing with MongoDB - MongoDC Nicholas Knize, Ph.D., GISP

HBase, crazy dances on the elephant back.Roman Nikitchenko

Storage in hadoopPuneet Tripathi

Optimizing columnar storesIstvan Szukacs

RocksDB storage engine for MySQL and MongoDBIgor Canadi

Gfs vs hdfsYuval Carmel

HBase Incremental BackupLee neal

Apache hadoop, hdfs and map reduce OverviewNisanth Simon

Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network

Ceph Day Berlin: Measuring and predicting performance of Ceph clustersCeph Community

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...AyeeshaParveen

Was ist angesagt? (20)

Apache HBase

RocksDB detail

Improve Presto Architectural Decisions with Shadow Cache

RocksDB compaction

Incremental backups

TriHUG 3/14: HBase in Production

The Hive Think Tank: Rocking the Database World with RocksDB

The Google Bigtable

Web scale monitoring

RTree Spatial Indexing with MongoDB - MongoDC

HBase, crazy dances on the elephant back.

Storage in hadoop

Optimizing columnar stores

RocksDB storage engine for MySQL and MongoDB

Gfs vs hdfs

HBase Incremental Backup

Apache hadoop, hdfs and map reduce Overview

Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia

Ceph Day Berlin: Measuring and predicting performance of Ceph clusters

Hadoop ecosystem; J.Ayeesha parveen 2 nd M.sc., computer science Bon Secours...

Andere mochten auch

Big datacamp june14_alex_liuData Con LA

Aziksa hadoop for buisness users2 santosh jhaData Con LA

Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA

La big datacamp2014_vikram_dixitData Con LA

Kiji cassandra la june 2014 - v02 clint-kellyData Con LA

Summit v4 dave wolcottData Con LA

20140614 introduction to spark-ben whiteData Con LA

140614 bigdatacamp-la-keynote-jon hsiehData Con LA

Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Data Con LA

Yarn cloudera-kathleenting061414 kate-tingData Con LA

2014 bigdatacamp asya_kamskyData Con LA

Ag big datacampla-06-14-2014-ajay_gopalData Con LA

Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA

Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Data Con LA

Hadoop Innovation Summit 2014Data Con LA

Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA

Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Data Con LA

Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Data Con LA

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Data Con LA

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA

Andere mochten auch (20)

Big datacamp june14_alex_liu

Aziksa hadoop for buisness users2 santosh jha

Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...

La big datacamp2014_vikram_dixit

Kiji cassandra la june 2014 - v02 clint-kelly

Summit v4 dave wolcott

20140614 introduction to spark-ben white

140614 bigdatacamp-la-keynote-jon hsieh

Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...

Yarn cloudera-kathleenting061414 kate-ting

2014 bigdatacamp asya_kamsky

Ag big datacampla-06-14-2014-ajay_gopal

Hadoop and NoSQL joining forces by Dale Kim of MapR

Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...

Hadoop Innovation Summit 2014

Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...

Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...

Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...

Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...

Ähnlich wie Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual

Hbase 20141003Jean-Baptiste Poullet

Hbase: an introductionJean-Baptiste Poullet

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network

Facebook keynote-nicolas-qconYiwei Ma

支撑Facebook消息处理的h base存储系统yongboy

Facebook Messages & HBase强王

Apache hadoop hbasesheetal sharma

Hadoop - Apache HbaseVibrant Technologies & Computers

Apache HBase™Prashant Gupta

Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.

4. hbase overviewAnuja Gunale

Meet Apache HBase - 2.0DataWorks Summit

Meet hbase 2.0enissoz

Meet HBase 2.0enissoz

Hbase Quick Review Guide for InterviewsRavindra kumar

Apache HBase 1.0 ReleaseNick Dimiduk

01 hbaseSubhas Kumar Ghosh

HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon

Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network

HBase Applications - Atlanta HUG - May 2014larsgeorge

Ähnlich wie Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual (20)

Hbase 20141003

Hbase: an introduction

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...

Facebook keynote-nicolas-qcon

支撑Facebook消息处理的h base存储系统

Facebook Messages & HBase

Apache hadoop hbase

Hadoop - Apache Hbase

Apache HBase™

Hw09 Practical HBase Getting The Most From Your H Base Install

4. hbase overview

Meet Apache HBase - 2.0

Meet hbase 2.0

Meet HBase 2.0

Hbase Quick Review Guide for Interviews

Apache HBase 1.0 Release

01 hbase

HBase at Bloomberg: High Availability Needs for the Financial Industry

Hive integration: HBase and Rcfile__HadoopSummit2010

HBase Applications - Atlanta HUG - May 2014

Mehr von Data Con LA

Data Con LA 2022 KeynotesData Con LA

Data Con LA 2022 KeynoteData Con LA

Data Con LA 2022 - Startup ShowcaseData Con LA

Data Con LA 2022 KeynoteData Con LA

Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA

Data Con LA 2022 - AI EthicsData Con LA

Data Con LA 2022 - Improving disaster response with machine learningData Con LA

Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA

Data Con LA 2022 - Real world consumer segmentationData Con LA

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA

Data Con LA 2022 - Moving Data at Scale to AWSData Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA

Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA

Data Con LA 2022 - Intro to Data ScienceData Con LA

Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA

Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA

Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA

Data Con LA 2022 - Data Streaming with KafkaData Con LA

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes

Data Con LA 2022 Keynote

Data Con LA 2022 - Startup Showcase

Data Con LA 2022 Keynote

Data Con LA 2022 - Using Google trends data to build product recommendations

Data Con LA 2022 - AI Ethics

Data Con LA 2022 - Improving disaster response with machine learning

Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas

Data Con LA 2022 - Real world consumer segmentation

Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...

Data Con LA 2022 - Moving Data at Scale to AWS

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI

Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...

Data Con LA 2022 - Intro to Data Science

Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...

Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...

Data Con LA 2022- Embedding medical journeys with machine learning to improve...

Data Con LA 2022 - Data Streaming with Kafka

Kürzlich hochgeladen

CloudStudio User manual (basic edition):comworks

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"ML in Production",Oleksandr BaganFwdays

AI as an Interface for Commercial BuildingsMemoori

Training state-of-the-art general text embeddingZilliz

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Kürzlich hochgeladen (20)

CloudStudio User manual (basic edition):

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Vertex AI Gemini Prompt Engineering Tips

Dev Dives: Streamline document processing with UiPath Studio Web

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Nell’iperspazio con Rocket: il Framework Web di Rust!

My Hashitalk Indonesia April 2024 Presentation

Are Multi-Cloud and Serverless Good or Bad?

Connect Wave/ connectwave Pitch Deck Presentation

Human Factors of XR: Using Human Factors to Design XR Systems

"ML in Production",Oleksandr Bagan

AI as an Interface for Commercial Buildings

Training state-of-the-art general text embedding

Anypoint Exchange: It’s Not Just a Repo!

Designing IA for AI - Information Architecture Conference 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

DevoxxFR 2024 Reproducible Builds with Apache Maven

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

What's New in Teams Calling, Meetings and Devices March 2024

WordPress Websites for Engineers: Elevate Your Brand

Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual

1. HBase and Batch Processing Molly O’Connor Factual

2. HBase at Factual Support API for global location queries and accept live writes of new supporting data Batch updates: ingesting large amounts of new data, pushing out new versions of the data (improvements in algorithms for data cleaning, verification, clustering)

3. Overview 1. HBase Architecture 2. Batch Workflow 3. Lessons/Challenges

4. HBase Intro -- Data Model ● Column families for HBase table are specified at creation time ● Arbitrary byte sequences for column qualifiers (unlimited and created as data is written) ● Data organized by column families and sorted by key

5. HBase Intro -- HFile Format ● Sorted lexicographically with secondary indices inline with data ● Block size: memory tradeoffs. Choose based on expected read access ● Compression: experiment: lzo, snappy, gz ● Index Size: don’t make keys and column names longer than needed.

6. HBase Intro -- Region Servers

7. HBase Intro -- Locality ● Region servers write new data locally ● Compaction further promotes data locality ● Metrics: at the region server level. In 1.0, “Block Locality”, in 0.94, “hdfsBlocksLocalityIndex” ● Enable short circuit reads for additional benefits

8. HBase Intro -- Consistency ● Single row atomicity across column families guaranteed ● checkAndPut -- single row, checks on the value of a single column only ● mutateRowsWithLocks --via coprocessor ○ within a region: need clever row key design, region split policy

9. Overview 1. HBase Architecture 2. Batch Workflow 3. Lessons/Challenges

10. ● Better performance for large scale updates ● Quality analysis and metrics on all data before adopting ● Perform computations that are not possible or prohibitively expensive in a live hbase setting ● Data is already on HDFS HBase and Batch

11. Batch Workflow 1. Snapshot 2. MapReduce 3. Bulkloading

12. HBase Snapshots ● Copy on write (HFile links) ● Per table ● Rolling ○ HBase’s guarantee of consistency within a row ● Use cases: backup/recovery, export, mapreduce

13. Snapshots and MapReduce ● Definitely use Mapreduce over snapshots, if possible (HBASE-8369) ○ before this feature, issues with reading HFiles directly because of compaction ○ Advantages: Job is faster and puts less pressure on region servers ○ Caveat: Not reading live data

14. Locality and Mapreduce Tradeoffs: want to colocate computation with data. But, this causes contention with HBase ○ Don’t run mapreduce on HBase nodes? ○ Mitigated somewhat with Yarn?

15. Federated Namespaces for Multi-Purpose Cluster

16. Bulkloading ● Additional path to ingesting data with HBase. Create HFiles directly, and HBase adopts the files ● Bulkload is atomic at the region level ○ single row consistency (across CF) guaranteed

17. Locality after Bulkloading ● Look at current region locations, and try to produce new HFiles on those nodes ● Compaction after bulkloading: needs to be timed well, but can eventually lead to locality ● New data will not be in the block cache! ● HBASE-11195 can promote compaction in cases where locality is low ● HBASE-8329 throttling compaction speed

18. Overview 1. HBase Architecture 2. Batch Workflow 3. Lessons/Challenges

19. Challenges ● Does bulkloading fit your data model? ○ Replay--do you need a catchup phase after data ingestion? ● Consistency beyond row level (using a library to manage a secondary index or other transactional writes)? ● Maybe use Mapreduce over live tables but throttle requests? HBASE-11598

20. Summary 1. Ability to do bulk updates can be hugely important for performance 2. HDFS integration and strong feature set make HBase a good choice for batch processing 3. More features coming (today highlighted many introduced in the new version 1.0)

21. Questions?

Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual

Ähnlich wie Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual (20)

Mehr von Data Con LA

Mehr von Data Con LA (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'Connor of Factual