SlideShare ist ein Scribd-Unternehmen logo
1 von 16
MapReduce over
snapshots
HBASE-8369

Enis Soztutar
Enis [at] apache [dot] org
@enissoz

© Hortonworks Inc. 2011

Page 1
About Me
• In the Hadoop space since 2007
• Committer and PMC Member in Apache HBase and Hadoop
• Working at Hortonworks as member of Technical Staff
• Twitter: @enissoz

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 2
Snapshots
• Currently a snapshot is a bunch of reference files together with some
metadata
• A table’ snapshot can contain
– Table descriptor
– List of regions
– References to files in the regions
– References to WALs for regionservers

• Current snapshot impl is flush based
– Forces flush to all regions, so that in-memory data is written to disk

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 3
MR over Snapshots
• Idea is do scan’s on the client side bypassing region servers
• Use snapshots since they are immutable
• Similar to short circuit hdfs reads
• TableSnapshotInputFormat works similar to TableInputFormat

• TableMapReduceUtil methods to configure the job

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 4
Deployment Options
HBase online
• Take snaphot while HBase is running
• Run MR job over the snapshot

HBase offline
• Take snapshot while HBase is running
• Export Snapshot using ExportSnapshot to a different hdfs
• Run MR job over snapshot with or without HBase running

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 5
TableSnapshotInputFormat
• Gets a Scan representing the query
• Restore the snapshot to a temporary directory
• For each region in the snapshot:
– Determine whether the region should be scanned (falls between scan start row and
stop row)
– Create one split per region in the scan range ( # of map tasks)
– Each RecordReader will open the region (Hregion) as in HRegionServer
– An internal RegionScanner is used for running the scan

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 6
API

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 7
Timeline
• Will (hopefully) be committed to trunk next week or so
• Interest in bringing this to 0.94 and 0.96 bases as well
• Will come in HDP-2.1, which will be based on 0.96 line

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 8
Security Aspects
• HBase user owns the files in filesystem
• Snapshot files are also owned by the HBase user
• Mapreduce job should be able to read the files in the snapshot + actual
data files
• HDFS only has posix-like perms based on user/group/other
– User running MR job has to be either the HBase user, or have group perms
– HDFS does not have ACL’s, so there is no easy way to grant read access at
filesystem layer

• Idea: similar to current short circuit impl, we can implement a FD
transfer
– User will submit jobs under her own user credentials
– Ask HBase daemons to open the files, and pass a handler / token

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 9
Performance
ScanTest:
• Scan
: open a scanner, do full table scan
• SnapshotScan : open a client-side scanner, do full table scan
• ScanMR
: parallel full table scan from MR
• SnapshotScanMR : do full table scan
•
•
•
•

8 Region servers, 6 disks each
HBase trunk
Hadoop-2.2 (HDP-2.0.7.0-12)
Load data with IntegrationTestBulkLoad
– Evenly distributed rows, created as bulk loaded hfiles. 3 column families

• # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store)
• Data sizes: 6.6G, 13.2G, 19.8G, 26.4G

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 10
Scan speed

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 11
API
• We do not want to limit snapshot scanning only to MapReduce
• Allow client side scanners over snapshot files

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 12
ResultScanner is main scan API

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 13
API (caution: not final yet)

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 14
To the future and beyond
• HBASE-8691 High-Throughput Streaming Scan API
• Can we bypass regionservers without taking snapshots?
• Bypass memstore data, or stream memstore data, but read directly from
hfiles
• Secure reading from snapshots
• Keep up with the updates at
– https://issues.apache.org/jira/browse/HBASE-8369

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 15
Thanks
Questions?
Enis Söztutar
enis [ at ] apache [dot] org
@enissoz

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 16

Weitere ähnliche Inhalte

Was ist angesagt?

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 

Was ist angesagt? (19)

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
HBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and SparkHBaseCon 2015: HBase and Spark
HBaseCon 2015: HBase and Spark
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
Apache Hive on ACID
Apache Hive on ACIDApache Hive on ACID
Apache Hive on ACID
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseApache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
Apache Phoenix and Apache HBase: An Enterprise Grade Data Warehouse
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 

Andere mochten auch

Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop User Group
 

Andere mochten auch (8)

HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Hadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User GroupHadoop, Hbase and Hive- Bay area Hadoop User Group
Hadoop, Hbase and Hive- Bay area Hadoop User Group
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 

Ähnlich wie Mapreduce over snapshots

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Ähnlich wie Mapreduce over snapshots (20)

HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Mapreduce over snapshots

  • 1. MapReduce over snapshots HBASE-8369 Enis Soztutar Enis [at] apache [dot] org @enissoz © Hortonworks Inc. 2011 Page 1
  • 2. About Me • In the Hadoop space since 2007 • Committer and PMC Member in Apache HBase and Hadoop • Working at Hortonworks as member of Technical Staff • Twitter: @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 2
  • 3. Snapshots • Currently a snapshot is a bunch of reference files together with some metadata • A table’ snapshot can contain – Table descriptor – List of regions – References to files in the regions – References to WALs for regionservers • Current snapshot impl is flush based – Forces flush to all regions, so that in-memory data is written to disk Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 3
  • 4. MR over Snapshots • Idea is do scan’s on the client side bypassing region servers • Use snapshots since they are immutable • Similar to short circuit hdfs reads • TableSnapshotInputFormat works similar to TableInputFormat • TableMapReduceUtil methods to configure the job Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 4
  • 5. Deployment Options HBase online • Take snaphot while HBase is running • Run MR job over the snapshot HBase offline • Take snapshot while HBase is running • Export Snapshot using ExportSnapshot to a different hdfs • Run MR job over snapshot with or without HBase running Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 5
  • 6. TableSnapshotInputFormat • Gets a Scan representing the query • Restore the snapshot to a temporary directory • For each region in the snapshot: – Determine whether the region should be scanned (falls between scan start row and stop row) – Create one split per region in the scan range ( # of map tasks) – Each RecordReader will open the region (Hregion) as in HRegionServer – An internal RegionScanner is used for running the scan Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 6
  • 7. API Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 7
  • 8. Timeline • Will (hopefully) be committed to trunk next week or so • Interest in bringing this to 0.94 and 0.96 bases as well • Will come in HDP-2.1, which will be based on 0.96 line Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 8
  • 9. Security Aspects • HBase user owns the files in filesystem • Snapshot files are also owned by the HBase user • Mapreduce job should be able to read the files in the snapshot + actual data files • HDFS only has posix-like perms based on user/group/other – User running MR job has to be either the HBase user, or have group perms – HDFS does not have ACL’s, so there is no easy way to grant read access at filesystem layer • Idea: similar to current short circuit impl, we can implement a FD transfer – User will submit jobs under her own user credentials – Ask HBase daemons to open the files, and pass a handler / token Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 9
  • 10. Performance ScanTest: • Scan : open a scanner, do full table scan • SnapshotScan : open a client-side scanner, do full table scan • ScanMR : parallel full table scan from MR • SnapshotScanMR : do full table scan • • • • 8 Region servers, 6 disks each HBase trunk Hadoop-2.2 (HDP-2.0.7.0-12) Load data with IntegrationTestBulkLoad – Evenly distributed rows, created as bulk loaded hfiles. 3 column families • # store files per region varies 3,6,9, and 12 (1,2,3,4 file per store) • Data sizes: 6.6G, 13.2G, 19.8G, 26.4G Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 10
  • 11. Scan speed Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 11
  • 12. API • We do not want to limit snapshot scanning only to MapReduce • Allow client side scanners over snapshot files Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 12
  • 13. ResultScanner is main scan API Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 13
  • 14. API (caution: not final yet) Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 14
  • 15. To the future and beyond • HBASE-8691 High-Throughput Streaming Scan API • Can we bypass regionservers without taking snapshots? • Bypass memstore data, or stream memstore data, but read directly from hfiles • Secure reading from snapshots • Keep up with the updates at – https://issues.apache.org/jira/browse/HBASE-8369 Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 15
  • 16. Thanks Questions? Enis Söztutar enis [ at ] apache [dot] org @enissoz Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 16