SlideShare ist ein Scribd-Unternehmen logo
1 von 37
HBase New Features
Richard Xu
rxu@hortonworks.com
Toronto Hadoop User Group
Today’s Agenda
• Background & Objectives
• Review HBase and NoSQL
• HBase HA
• HBase Off-Heap
• HBase on Yarn
• HBase Security
• HBase 1.0
Page 2
Background & Objectives
• Have been working on HBase since 2011
• Add-on to the HBase talk by Adam Muise on Sep 17,
2013
Page 3
HBase HA
Timeline-Consistent High Availability for HBase
Page 4
Data Assignment in HBase Classic
Page 5
Data is range partitioned and each key belongs to exactly one RegionServer
HBase Table
Keys within HBase
Divided among
different RegionServers
Data Assignment with HBase HA
Page 6
Each key has a primary RegionServer and a backup RegionServer
HBase Table
Keys within HBase
Divided among
different RegionServers
Differences between Primary and Standby
• Primary:
–Handles reads or writes.
–“Owns” the data and has the latest value.
• Standby
–Handles only reads.
–Data may be stale to some degree.
–When data is read from Standby it is marked as potentially stale.
Page 7
HBase HA: Warm Standby RegionServers
Redundant RegionServers provide read availability with
near zero downtime during failures.
Page 8
Client
1 (or more)
standby
RegionServers
RS 1 RS 1*
HDFS
Data replicated via HDFS
HBase HA Delivered in 2 Phases
Page 9
HBase HA Phase 1 HBase HA Phase 2
• Standby RegionServers.
• Primary RegionServers configured
to flush every 5 minutes or less.
• Standbys serve reads in < 5s, data
at most 5 minutes stale.
• Standbys serve reads in under 1s.
Stale reads mostly eliminated.
• Write-Ahead Log per RegionServer
• Active WAL tailing in standby
RegionServers.
• Faster recovery of failed
RegionServers.
Note: HA covers read availability. Writes still
coordinated by primaries.
What is Timeline Consistency?
• Readers all agree on current value, when it can be
read from the Primary.
• When reading from Secondary, clients see all updates
in the same order.
• Result:
–Eliminates different clients making decisions on different data.
–Simplifies programming logic and complex corner cases versus
eventual consistency.
–Lower latency than quorum based strong consistency.
Page 10
Configuring HBase HA: Server Side
Page 11
<property>
<name>hbase.regionserver.storefile.refresh.period</name>
<value>0</value>
<description>
The period (in milliseconds) for refreshing the store files for the secondary
regions. 0 means this feature is disabled. Secondary regions sees new files (from
flushes and compactions) from primary once the secondary region refreshes the list
of files in the region (there is no notification mechanism). But too frequent
refreshes might cause extra Namenode pressure.
</description>
</property>
<property>
<name>hbase.master.loadbalancer.class</name>
<value>org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer</value>
<description>
Only StochasticLoadBalancer is supported for using region replicas
</description>
</property>
Suggested value for refresh period = 300000 (300 seconds / 5 minutes)
Leads to max data staleness of about 5 minutes.
Configuring HBase HA: Client Side
Page 12
<property>
<name>hbase.ipc.client.allowsInterrupt</name>
<value>true</value>
<description>
Whether to enable interruption of RPC threads at the client side. This is required
for region replicas with fallback RPC’s to secondary regions.
</description>
</property>
<property>
<name>hbase.client.primaryCallTimeout.get</name>
<value>10000</value>
<description>
The timeout (in microseconds), before secondary fallback RPC’s are submitted for
get requests with Consistency.TIMELINE to the secondary replicas of the regions.
Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will
lower the p99 latencies.
</description>
</property>
Reaching out to secondary RegionServers is an option per request
HBase Off-Heap
Low Latency Access to Big Data
Page 13
Off-Heap Support
• Using off-heap memory allows RegionServers to scale
beyond traditional 16GB barrier.
• Benefits:
–Eliminates latency hiccups related to garbage collection pauses.
–Makes it easier to run HBase on servers with large RAM.
–Certified up to 96GB off-heap memory in one RegionServer.
Page 14
Managed by JVM
Garbage Collection
Managed by HBase
On-Heap Memory Off-Heap Memory
RegionServer Process
4.485 3 12 19 29
610
4.458 3 11 18 27
134
0
100
200
300
400
500
600
Avg Median 95% 99% 99.9% Max
Latency Comparison: On-Heap versus Off-Heap
Point Lookups, 400GB Dataset, 75% of data in memory
On Heap Off Heap (BucketCache)
HBase Off-Heap Reduces Latency
Page 15
50 concurrent clients
Latency (ms)
Fast Access to Big Data with Off-Heap
Page 16
1.3 4.1
14.3
20.0
27.7
38.7
265.0
0.0
50.0
100.0
150.0
200.0
250.0
300.0
Median Average 95% 99% 99.90% 99.99% 99.999%
Latency Measures using Off-Heap
Point Lookups, 3TB Dataset, 100% of data in memory
Latency (ms) 50 concurrent clients
Throughput = 1095 reqs/s
More to come…
• SlabCache, BucketCache(Hbase-7404)
• Hbase-9535: network interface
• Hbase-10191: new read/write pipeline with end-to-end offheap
Page 17
Hive over HBase Snapshots
Page 18
Analytics over HBase Snapshots
• What is it?
• Introduces the ability to run Hive queries over HBase snapshots.
• Why is this important?
• Hive can access the data via disk rather than via networking.
• More performant and less disruptive to other HBase clients.
• When to use it?
• Use this feature when you have full-table scans over all data in HBase.
• Not appropriate for analytics of small subsets of data in HBase.
• Note:
• Snapshot data may not be the latest.
• Tradeoff between performance and data freshness.
Hive over HBase Snapshot: About 2.5x
Faster
Query Run Workload
Snapshot Time
(s)
Direct Time
(s)
Time X
Factor
count(*) 1 a 191.019 488.915 2.56x
count(*) 2 a 200.641 480.837 2.40x
Aggregate 1 field 1 a 214.452 499.304 2.33x
Aggregate 1 field 2 a 217.744 500.07 2.30x
Aggregate 9 fields 1 a 281.514 802.799 2.85x
Aggregate 9 fields 2 a 272.358 785.816 2.89x
Aggregate 1 with
GBY
1 a 248.874 558.143 2.24x
Aggregate 1 with
GBY
2 a 269.658 533.562 1.98x
count(*) 1 b 194.739 482.261 2.48x
count(*) 2 b 195.178 481.437 2.47x
Aggregate 1 field 1 b 220.325 498.956 2.26x
Aggregate 1 field 2 b 227.117 489.27 2.15x
Aggregate 9 fields 1 b 276.939 817.118 2.95x
Aggregate 9 fields 2 b 290.288 876.753 3.02x
Aggregate 1 with
GBY
1 b 244.025 563.884 2.31x
Aggregate 1 with
GBY
2 b 225.431 570.723 2.53x
count(*) 1 c 194.568 502.79 2.58x
count(*) 2 c 205.418 508.319 2.47x
Aggregate 1 field 1 c 209.709 531.39 2.53x
Aggregate 1 field 2 c 217.551 526.878 2.42x
Aggregate 9 fields 1 c 267.93 756.476 2.82x
Aggregate 9 fields 2 c 273.107 723.459 2.65x
Aggregate 1 with
GBY
1 c 240.991 526.053 2.18x
Aggregate 1 with
GBY
2 c 258.06 527.845 2.05x
Test Scenario:
• YCSB Data Load.
• 180 million rows.
• 20 node cluster, 6 disk/node, 10GB net.
• Query run while simultaneously
running a YCSB workload.
• Direct time = query via HBase API.
• Snapshot time = query by reading
snapshot.
• Query over snapshot ~2.5x faster.
Analytics over HBase Snapshot: Usage
Patterns
Co-located Analytics
Note: Consider tuning these values:
• hbase.client.retries.number
• hbase.rpc.timeout
• zookeeper.session.timeout
• zookeeper.recovery.retry
If using co-located analytics.
HBase
Clients
Snapsh
ots
Tez /
MR
Operational Reporting
HBase
1
HBase
2Replication
Clients
Snapsh
ots
Tez /
MR
Better for strict SLAs.
Analytics over HBase Snapshots: Example
# Create a snapshot in the HBase Shell or via API.
snapshot ‘usertable’, ‘snapshot_2014_08_03’
# Refer to the same snapshot in the Hive shell.
set hive.hbase.snapshot.name=snapshot_2014_08_03;
set hive.hbase.snapshot.restoredir=/tmp/restore;
select count(*) from hbase_table;
# You can “unset hive.hbase.snapshot.name” to stop using the
snapshot.
Note: Be sure to delete your snapshots after you’re done with them.
HBase on Yarn using Apache
Slider
Page 23
Deploying HBase with Slider
• What is it?
• Deploy HBase into the Hadoop cluster using YARN.
Benefit Details
Simplified
Deployment
No need to deploy HBase or its configuration to
individual cluster nodes.
Lifecycle
Management
Start / stop / process management handled
automatically.
Multitenancy Different users can run HBase clusters within one
Hadoop cluster.
Multiple Versions Run different versions of HBase (e.g. 0.98 and 1.0) on
the same cluster.
Elasticity Cluster size is a parameter and easily changed.
Co-located
Analytics
HBase resource usage is known to YARN, nodes
running HBase will not be used as heavily to satisfy
MapReduce or Tez jobs.
Demo
Page 25
HBase Security
Page 26
HBase Cell Level Security
• Table/Column Family ACLs since 0.92
• HBase-6222: Add per-KeyValue security since 0.98
• APIs stable as of 1.0
Page 27
Security in Hadoop with HDP + XA Secure
Authorization
Restrict access
to explicit data
Audit
Understand
who did what
Data Protection
Encrypt data at
rest & in motion
• Kerberos in
native Apache
Hadoop
• HTTP/REST
API Secured
with Apache
Knox Gateway
• MapReduce Access Control Lists
• HDFS Permissions, HDFS ACL,
• Audit logs in with HDFS & MR
• Hive ATZ-NG
• Cell level access control in
Apache Accumulo
Authentication
Who am I/prove
it?
• Wire
encryption in
Hadoop
• Orchestrated
encryption
with 3rd party
tools
• HDFS, Hive
& Hbase
• RBAC
• Centralized
audit
reporting
• Policy and
access
history
• Future
roadmap
• Strategy to
be finalized
HDP2.1XASecure
Centralized Security Administration
• As-Is, works
with current
authenticatio
n methods
XA Secure Integration with Hadoop
Hadoop
distributed file
system (HDFS)
XA Administration Portal
HBase
Hive
Server2
XA Policy
Server
XA Audit
Server
XA
Agent
HadoopComponentsEnterprise
Users
XA
Agent
XA
Agent
Legacy
Tools
Integration
API
RDBM
S
HDFS
Search
Falcon
XA
Agent
*
XA
Agent
*
XA
Agent
*
Storm
YARN : Data Operating System
XA
Agent
*
* - Future Integration
Simplified Workflow - HBase
30
XA
Policy
Manag
er
XA
Agent
Admin sets policies for
HBase table/cf/column
Data scientist
runs a map
reduce job
User
Applicati
on
HBase
Server
Audit
Databas
e
Audit logs pushed to
DB
HBase
server
provide
data access
to users
1
2
3
4
5
IT users
access
HBase via
HBShell
2
HBase
Authorizes
with XAAgent
Users access HBase
data using Java API
2
HBase 1.0 major changes
Page 31
Stability: Co-Locate Meta with Master
(HBASE-10569)
• Simplify, Improve region assignment reliability
– Fewer components involved in updating “truth”. (ZK-less region assignment,
HBASE-11059)
• Master embeds a RegionServer
– Will host only system tables
– Baby step towards combining RS/Master into a single hbase daemon
• Backup masters unchanged
– Can be configured to host user tables while in standby
• Plumbing is all there, OFF by default
– Jira: HBASE-10569.
Availability: Region Replicas
• Multiple RegionServers host a Region
– One is primary, others are replicas
– Only primary accepts writes
• Client reads against primary only or any
– Results marked as appropriate
• Baby step towards quorum reads, writes
• Plumbing is all there, OFF by default
– Jira: HBASE-10070.
New and Noteworthy
• Client API cleanup: jira HBASE-10602
• Automatic tuning of global MemStore and BlockCache
sizes
• BucketCache easier to configure
• Compressed BlockCache
• Pluggable replication endpoint
• A Dockerfile to easily build and run HBase from source
…
Under the Covers
• Zookeeper abstractions
• Meta table used for assignment
• Cell-based read/write path
• Combining mvcc/seqid
• Sundry security, tags, labels improvements
Groundwork for 2.0
• More, Smaller Regions
– Millions, 1G or less (HBASE-11165)
– Less write amplification
– Splitting hbase:meta
• Performance
– More off-heap
– Less resource contention
– Faster region failover/recovery
– Multiple WALs
– QoS/Quotas/Multi-tenancy
• Rigging
– Faster, more intelligent assignment
– Procedure bus (HBASE-12439)
– Resumable, query-able operations
• Other possibilities
– Quorum/consensus reads, writes?
– Hydrabase, multi-DC consensus?
– Streaming RPCs?
– High level coprocessor API?
References
• Enis Soztutar: Hbase Read High Availability Using
Timeline Consistent Region Replicas
• Nick Dimiduk: Apache HBase 1.0 Release
• …
Page 37

Weitere ähnliche Inhalte

Was ist angesagt?

Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
Cloudera, Inc.
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
abord
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
alanfgates
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
DataWorks Summit
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11
Hortonworks
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

Was ist angesagt? (20)

Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hw09 Monitoring Best Practices
Hw09   Monitoring Best PracticesHw09   Monitoring Best Practices
Hw09 Monitoring Best Practices
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
MapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR HadoopMapReduce Improvements in MapR Hadoop
MapReduce Improvements in MapR Hadoop
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
 
Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014Hive analytic workloads hadoop summit san jose 2014
Hive analytic workloads hadoop summit san jose 2014
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
HBase Backups
HBase BackupsHBase Backups
HBase Backups
 
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agentsTuning Apache Ambari performance for Big Data at scale with 3000 agents
Tuning Apache Ambari performance for Big Data at scale with 3000 agents
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Rigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance MeasurementRigorous and Multi-tenant HBase Performance Measurement
Rigorous and Multi-tenant HBase Performance Measurement
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
 
New features in Pig 0.11
New features in Pig 0.11New features in Pig 0.11
New features in Pig 0.11
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Andere mochten auch

Andere mochten auch (20)

20150321 医学:医療者教育研究ネットワーク@九州大学
20150321 医学:医療者教育研究ネットワーク@九州大学20150321 医学:医療者教育研究ネットワーク@九州大学
20150321 医学:医療者教育研究ネットワーク@九州大学
 
JSME_47th_Nigata
JSME_47th_NigataJSME_47th_Nigata
JSME_47th_Nigata
 
20151128_SMeNG_態度は変えられるのか
20151128_SMeNG_態度は変えられるのか20151128_SMeNG_態度は変えられるのか
20151128_SMeNG_態度は変えられるのか
 
20150827_simplesize
20150827_simplesize20150827_simplesize
20150827_simplesize
 
Apache Drill で日本語を扱ってみよう + オープンデータ解析
Apache Drill で日本語を扱ってみよう + オープンデータ解析Apache Drill で日本語を扱ってみよう + オープンデータ解析
Apache Drill で日本語を扱ってみよう + オープンデータ解析
 
MapR アーキテクチャ概要 - MapR CTO Meetup 2013/11/12
MapR アーキテクチャ概要 - MapR CTO Meetup 2013/11/12MapR アーキテクチャ概要 - MapR CTO Meetup 2013/11/12
MapR アーキテクチャ概要 - MapR CTO Meetup 2013/11/12
 
MapR Streams & MapR コンバージド・データ・プラットフォーム
MapR Streams & MapR コンバージド・データ・プラットフォームMapR Streams & MapR コンバージド・データ・プラットフォーム
MapR Streams & MapR コンバージド・データ・プラットフォーム
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
20170225_Sample size determination
20170225_Sample size determination20170225_Sample size determination
20170225_Sample size determination
 
Drill超簡単チューニング
Drill超簡単チューニングDrill超簡単チューニング
Drill超簡単チューニング
 
MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data Platform
 
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッションApache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
Apache Drill でたしなむ セルフサービスデータ探索 - 2014/11/06 Cloudera World Tokyo 2014 LTセッション
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
Big Data Hadoop Briefing Hosted by Cisco, WWT and MapR: MapR Overview Present...
 
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
ストリーミングアーキテクチャ: State から Flow へ - 2016/02/08 Hadoop / Spark Conference Japan ...
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Docker1.13で変わったことをわからないなりにまとめてみた
Docker1.13で変わったことをわからないなりにまとめてみたDocker1.13で変わったことをわからないなりにまとめてみた
Docker1.13で変わったことをわからないなりにまとめてみた
 
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)
リクルートライフスタイルの考えるストリームデータの活かし方(Hadoop Spark Conference2016)リクルートライフスタイルの考えるストリームデータの活かし方(Hadoop Spark Conference2016)
リクルートライフスタイルの考える ストリームデータの活かし方(Hadoop Spark Conference2016)
 
Innovation and Management in the Era of “Co-Creation”—Cultivating Knowledge...
 Innovation and Management  in the Era of “Co-Creation”—Cultivating Knowledge... Innovation and Management  in the Era of “Co-Creation”—Cultivating Knowledge...
Innovation and Management in the Era of “Co-Creation”—Cultivating Knowledge...
 

Ähnlich wie HBase New Features

Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Yahoo Developer Network
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
zpinter
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 

Ähnlich wie HBase New Features (20)

HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Meet Apache HBase - 2.0
Meet Apache HBase - 2.0Meet Apache HBase - 2.0
Meet Apache HBase - 2.0
 
Meet HBase 2.0
Meet HBase 2.0Meet HBase 2.0
Meet HBase 2.0
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Azure DBA with IaaS
Azure DBA with IaaSAzure DBA with IaaS
Azure DBA with IaaS
 
What's new in hadoop 3.0
What's new in hadoop 3.0What's new in hadoop 3.0
What's new in hadoop 3.0
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

HBase New Features

  • 1. HBase New Features Richard Xu rxu@hortonworks.com Toronto Hadoop User Group
  • 2. Today’s Agenda • Background & Objectives • Review HBase and NoSQL • HBase HA • HBase Off-Heap • HBase on Yarn • HBase Security • HBase 1.0 Page 2
  • 3. Background & Objectives • Have been working on HBase since 2011 • Add-on to the HBase talk by Adam Muise on Sep 17, 2013 Page 3
  • 4. HBase HA Timeline-Consistent High Availability for HBase Page 4
  • 5. Data Assignment in HBase Classic Page 5 Data is range partitioned and each key belongs to exactly one RegionServer HBase Table Keys within HBase Divided among different RegionServers
  • 6. Data Assignment with HBase HA Page 6 Each key has a primary RegionServer and a backup RegionServer HBase Table Keys within HBase Divided among different RegionServers
  • 7. Differences between Primary and Standby • Primary: –Handles reads or writes. –“Owns” the data and has the latest value. • Standby –Handles only reads. –Data may be stale to some degree. –When data is read from Standby it is marked as potentially stale. Page 7
  • 8. HBase HA: Warm Standby RegionServers Redundant RegionServers provide read availability with near zero downtime during failures. Page 8 Client 1 (or more) standby RegionServers RS 1 RS 1* HDFS Data replicated via HDFS
  • 9. HBase HA Delivered in 2 Phases Page 9 HBase HA Phase 1 HBase HA Phase 2 • Standby RegionServers. • Primary RegionServers configured to flush every 5 minutes or less. • Standbys serve reads in < 5s, data at most 5 minutes stale. • Standbys serve reads in under 1s. Stale reads mostly eliminated. • Write-Ahead Log per RegionServer • Active WAL tailing in standby RegionServers. • Faster recovery of failed RegionServers. Note: HA covers read availability. Writes still coordinated by primaries.
  • 10. What is Timeline Consistency? • Readers all agree on current value, when it can be read from the Primary. • When reading from Secondary, clients see all updates in the same order. • Result: –Eliminates different clients making decisions on different data. –Simplifies programming logic and complex corner cases versus eventual consistency. –Lower latency than quorum based strong consistency. Page 10
  • 11. Configuring HBase HA: Server Side Page 11 <property> <name>hbase.regionserver.storefile.refresh.period</name> <value>0</value> <description> The period (in milliseconds) for refreshing the store files for the secondary regions. 0 means this feature is disabled. Secondary regions sees new files (from flushes and compactions) from primary once the secondary region refreshes the list of files in the region (there is no notification mechanism). But too frequent refreshes might cause extra Namenode pressure. </description> </property> <property> <name>hbase.master.loadbalancer.class</name> <value>org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer</value> <description> Only StochasticLoadBalancer is supported for using region replicas </description> </property> Suggested value for refresh period = 300000 (300 seconds / 5 minutes) Leads to max data staleness of about 5 minutes.
  • 12. Configuring HBase HA: Client Side Page 12 <property> <name>hbase.ipc.client.allowsInterrupt</name> <value>true</value> <description> Whether to enable interruption of RPC threads at the client side. This is required for region replicas with fallback RPC’s to secondary regions. </description> </property> <property> <name>hbase.client.primaryCallTimeout.get</name> <value>10000</value> <description> The timeout (in microseconds), before secondary fallback RPC’s are submitted for get requests with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies. </description> </property> Reaching out to secondary RegionServers is an option per request
  • 13. HBase Off-Heap Low Latency Access to Big Data Page 13
  • 14. Off-Heap Support • Using off-heap memory allows RegionServers to scale beyond traditional 16GB barrier. • Benefits: –Eliminates latency hiccups related to garbage collection pauses. –Makes it easier to run HBase on servers with large RAM. –Certified up to 96GB off-heap memory in one RegionServer. Page 14 Managed by JVM Garbage Collection Managed by HBase On-Heap Memory Off-Heap Memory RegionServer Process
  • 15. 4.485 3 12 19 29 610 4.458 3 11 18 27 134 0 100 200 300 400 500 600 Avg Median 95% 99% 99.9% Max Latency Comparison: On-Heap versus Off-Heap Point Lookups, 400GB Dataset, 75% of data in memory On Heap Off Heap (BucketCache) HBase Off-Heap Reduces Latency Page 15 50 concurrent clients Latency (ms)
  • 16. Fast Access to Big Data with Off-Heap Page 16 1.3 4.1 14.3 20.0 27.7 38.7 265.0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 Median Average 95% 99% 99.90% 99.99% 99.999% Latency Measures using Off-Heap Point Lookups, 3TB Dataset, 100% of data in memory Latency (ms) 50 concurrent clients Throughput = 1095 reqs/s
  • 17. More to come… • SlabCache, BucketCache(Hbase-7404) • Hbase-9535: network interface • Hbase-10191: new read/write pipeline with end-to-end offheap Page 17
  • 18. Hive over HBase Snapshots Page 18
  • 19. Analytics over HBase Snapshots • What is it? • Introduces the ability to run Hive queries over HBase snapshots. • Why is this important? • Hive can access the data via disk rather than via networking. • More performant and less disruptive to other HBase clients. • When to use it? • Use this feature when you have full-table scans over all data in HBase. • Not appropriate for analytics of small subsets of data in HBase. • Note: • Snapshot data may not be the latest. • Tradeoff between performance and data freshness.
  • 20. Hive over HBase Snapshot: About 2.5x Faster Query Run Workload Snapshot Time (s) Direct Time (s) Time X Factor count(*) 1 a 191.019 488.915 2.56x count(*) 2 a 200.641 480.837 2.40x Aggregate 1 field 1 a 214.452 499.304 2.33x Aggregate 1 field 2 a 217.744 500.07 2.30x Aggregate 9 fields 1 a 281.514 802.799 2.85x Aggregate 9 fields 2 a 272.358 785.816 2.89x Aggregate 1 with GBY 1 a 248.874 558.143 2.24x Aggregate 1 with GBY 2 a 269.658 533.562 1.98x count(*) 1 b 194.739 482.261 2.48x count(*) 2 b 195.178 481.437 2.47x Aggregate 1 field 1 b 220.325 498.956 2.26x Aggregate 1 field 2 b 227.117 489.27 2.15x Aggregate 9 fields 1 b 276.939 817.118 2.95x Aggregate 9 fields 2 b 290.288 876.753 3.02x Aggregate 1 with GBY 1 b 244.025 563.884 2.31x Aggregate 1 with GBY 2 b 225.431 570.723 2.53x count(*) 1 c 194.568 502.79 2.58x count(*) 2 c 205.418 508.319 2.47x Aggregate 1 field 1 c 209.709 531.39 2.53x Aggregate 1 field 2 c 217.551 526.878 2.42x Aggregate 9 fields 1 c 267.93 756.476 2.82x Aggregate 9 fields 2 c 273.107 723.459 2.65x Aggregate 1 with GBY 1 c 240.991 526.053 2.18x Aggregate 1 with GBY 2 c 258.06 527.845 2.05x Test Scenario: • YCSB Data Load. • 180 million rows. • 20 node cluster, 6 disk/node, 10GB net. • Query run while simultaneously running a YCSB workload. • Direct time = query via HBase API. • Snapshot time = query by reading snapshot. • Query over snapshot ~2.5x faster.
  • 21. Analytics over HBase Snapshot: Usage Patterns Co-located Analytics Note: Consider tuning these values: • hbase.client.retries.number • hbase.rpc.timeout • zookeeper.session.timeout • zookeeper.recovery.retry If using co-located analytics. HBase Clients Snapsh ots Tez / MR Operational Reporting HBase 1 HBase 2Replication Clients Snapsh ots Tez / MR Better for strict SLAs.
  • 22. Analytics over HBase Snapshots: Example # Create a snapshot in the HBase Shell or via API. snapshot ‘usertable’, ‘snapshot_2014_08_03’ # Refer to the same snapshot in the Hive shell. set hive.hbase.snapshot.name=snapshot_2014_08_03; set hive.hbase.snapshot.restoredir=/tmp/restore; select count(*) from hbase_table; # You can “unset hive.hbase.snapshot.name” to stop using the snapshot. Note: Be sure to delete your snapshots after you’re done with them.
  • 23. HBase on Yarn using Apache Slider Page 23
  • 24. Deploying HBase with Slider • What is it? • Deploy HBase into the Hadoop cluster using YARN. Benefit Details Simplified Deployment No need to deploy HBase or its configuration to individual cluster nodes. Lifecycle Management Start / stop / process management handled automatically. Multitenancy Different users can run HBase clusters within one Hadoop cluster. Multiple Versions Run different versions of HBase (e.g. 0.98 and 1.0) on the same cluster. Elasticity Cluster size is a parameter and easily changed. Co-located Analytics HBase resource usage is known to YARN, nodes running HBase will not be used as heavily to satisfy MapReduce or Tez jobs.
  • 27. HBase Cell Level Security • Table/Column Family ACLs since 0.92 • HBase-6222: Add per-KeyValue security since 0.98 • APIs stable as of 1.0 Page 27
  • 28. Security in Hadoop with HDP + XA Secure Authorization Restrict access to explicit data Audit Understand who did what Data Protection Encrypt data at rest & in motion • Kerberos in native Apache Hadoop • HTTP/REST API Secured with Apache Knox Gateway • MapReduce Access Control Lists • HDFS Permissions, HDFS ACL, • Audit logs in with HDFS & MR • Hive ATZ-NG • Cell level access control in Apache Accumulo Authentication Who am I/prove it? • Wire encryption in Hadoop • Orchestrated encryption with 3rd party tools • HDFS, Hive & Hbase • RBAC • Centralized audit reporting • Policy and access history • Future roadmap • Strategy to be finalized HDP2.1XASecure Centralized Security Administration • As-Is, works with current authenticatio n methods
  • 29. XA Secure Integration with Hadoop Hadoop distributed file system (HDFS) XA Administration Portal HBase Hive Server2 XA Policy Server XA Audit Server XA Agent HadoopComponentsEnterprise Users XA Agent XA Agent Legacy Tools Integration API RDBM S HDFS Search Falcon XA Agent * XA Agent * XA Agent * Storm YARN : Data Operating System XA Agent * * - Future Integration
  • 30. Simplified Workflow - HBase 30 XA Policy Manag er XA Agent Admin sets policies for HBase table/cf/column Data scientist runs a map reduce job User Applicati on HBase Server Audit Databas e Audit logs pushed to DB HBase server provide data access to users 1 2 3 4 5 IT users access HBase via HBShell 2 HBase Authorizes with XAAgent Users access HBase data using Java API 2
  • 31. HBase 1.0 major changes Page 31
  • 32. Stability: Co-Locate Meta with Master (HBASE-10569) • Simplify, Improve region assignment reliability – Fewer components involved in updating “truth”. (ZK-less region assignment, HBASE-11059) • Master embeds a RegionServer – Will host only system tables – Baby step towards combining RS/Master into a single hbase daemon • Backup masters unchanged – Can be configured to host user tables while in standby • Plumbing is all there, OFF by default – Jira: HBASE-10569.
  • 33. Availability: Region Replicas • Multiple RegionServers host a Region – One is primary, others are replicas – Only primary accepts writes • Client reads against primary only or any – Results marked as appropriate • Baby step towards quorum reads, writes • Plumbing is all there, OFF by default – Jira: HBASE-10070.
  • 34. New and Noteworthy • Client API cleanup: jira HBASE-10602 • Automatic tuning of global MemStore and BlockCache sizes • BucketCache easier to configure • Compressed BlockCache • Pluggable replication endpoint • A Dockerfile to easily build and run HBase from source …
  • 35. Under the Covers • Zookeeper abstractions • Meta table used for assignment • Cell-based read/write path • Combining mvcc/seqid • Sundry security, tags, labels improvements
  • 36. Groundwork for 2.0 • More, Smaller Regions – Millions, 1G or less (HBASE-11165) – Less write amplification – Splitting hbase:meta • Performance – More off-heap – Less resource contention – Faster region failover/recovery – Multiple WALs – QoS/Quotas/Multi-tenancy • Rigging – Faster, more intelligent assignment – Procedure bus (HBASE-12439) – Resumable, query-able operations • Other possibilities – Quorum/consensus reads, writes? – Hydrabase, multi-DC consensus? – Streaming RPCs? – High level coprocessor API?
  • 37. References • Enis Soztutar: Hbase Read High Availability Using Timeline Consistent Region Replicas • Nick Dimiduk: Apache HBase 1.0 Release • … Page 37

Hinweis der Redaktion

  1. Create a replicated table Insert some data into it flush it Kill the primary Attempt a write – fails Read a value
  2. Test performed using 6 AWS nodes (i2.8xlarge) + 5 client nodes (m2.4xlarge)
  3. Test performed using 6 AWS nodes (i2.8xlarge) + 5 client nodes (m2.4xlarge)