SlideShare a Scribd company logo
1 of 10
Download to read offline
Prepared by,
Vetri.V
WHAT IS HBASE?
 HBase is a database: the Hadoop database.It is indexed by rowkey, column key, and
timestamp.
 HBase stores structured and semistructured data naturally so you can load it with
tweets and parsed log files and a catalog of all your products right along with their
customer reviews.
 It can store unstructured data too, as long as it’s not too large
 HBase is designed to run on a cluster of computers instead of a single computer.The
cluster can be built using commodity hardware; HBase scales horizontally as you
add more machines to the cluster.
 Each node in the cluster provides a bit of storage, a bit of cache, and a bit of
computation as well. This makes HBase incredibly flexible and forgiving. No node is
unique, so if one of those machines breaks down, you simply replace it with another.
 This adds up to a powerful, scalable approach to data that,until now, hasn’t been
commonly available to mere mortals.
HBASE DATA MODEL:
Hbase Data model - These six concepts form the foundation of HBase.
Table:
 HBase organizes data into tables. Table names are Strings and composed of
characters that are safe for use in a file system path.
Row :
 Within a table, data is stored according to its row. Rows are identified uniquely by
their rowkey. Rowkeys don’t have a data type and are always treated as a
byte[].
Column family:
 Data within a row is grouped by column family. Column families also impact the
physical arrangement of data stored in HBase.
 For this reason, they must be defined up front and aren’t easily modified. Every row
in a table has the same column families, although a row need not store data in all its
families. Column family names are Strings and composed of characters that are safe
for use in a file system path.
Column qualifier:
 Data within a column family is addressed via its column qualifier,or column. Column
qualifiers need not be specified in advance. Column qualifiers
need not be consistent between rows.
 Like rowkeys, column qualifiers don’t have a data type and are always treated as a
byte[].
Prepared by,
Vetri.V
Cell:
 A combination of rowkey, column family, and column qualifier uniquely identifies a
cell. The data stored in a cell is referred to as that cell’s value. Values
also don’t have a data type and are always treated as a byte[].
Version:
 Values within a cell are versioned. Versions are identified by their timestamp,a long.
When a version isn’t specified, the current timestamp is used as the
basis for the operation. The number of cell value versions retained by HBase is
configured via the column family. The default number of cell versions is three.
Hbase Architecture
HBase Tables and Regions
Table is made up of any number of regions.
Region is specified by its startKey and endKey.
 Empty table: (Table, NULL, NULL)
 Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”,
NULL)
Each region may live on a different node and is made up of several HDFS files and blocks,
each of which is replicated by Hadoop
HBase Tables:-
 Tables are sorted by Row in lexicographical order
 Table schema only defines its column families
 Each family consists of any number of columns
 Each column consists of any number of versions
 Columns only exist when inserted, NULLs are free
 Columns within a family are sorted and stored together
 Everything except table names are byte[]
Prepared by,
Vetri.V
 Hbase Table format (Row, Family:Column, Timestamp) -> Value
HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover
Hbase consists of,
 Java API, Gateway for REST, Thrift, Avro
 Master manages cluster
 RegionServer manage data
 ZooKeeper is used the “neural network” and coordinates cluster
Data is stored in memory and flushed to disk on regular intervals or based on size
 Small flushes are merged in the background to keep number of files small
 Reads read memory stores first and then disk based files second
 Deletes are handled with “tombstone” markers
MemStores:-
After data is written to the WAL the RegionServer saves KeyValues in memory store
 Flush to disk based on size, is hbase.hregion.memstore.flush.size
 Default size is 64MB
 Uses snapshot mechanism to write flush to disk while still serving from it and
accepting new data at the same time
Compactions:-
Two types: Minor and Major Compactions
Minor Compactions
 Combine last “few” flushes
 Triggered by number of storage files
Major Compactions
 Rewrite all storage files
 Drop deleted data and those values exceeding TTL and/or number of versions
Key Cardinality:-
The best performance is gained from using row keys
Prepared by,
Vetri.V
 Time range bound reads can skip store files
 So can Bloom Filters
 Selecting column families reduces the amount of data to be scanned
Fold, Store, and Shift:-
All values are stored with the full coordinates,including: Row Key, Column Family, Column
Qualifier, and Timestamp
 Folds columns into “row per column”
 NULLs are cost free as nothing is stored
 Versions are multiple “rows” in folded table
DDI:-
Stands for Denormalization, Duplication and Intelligent Keys
Block Cache
Region Splits
Hbase shell and Commands
Hbase Install
$ mkdir hbase-install
$ cd hbase-install
$ wget http://apache.claz.org/hbase/hbase-0.92.1/hbase-0.92.1.tar.gz
Prepared by,
Vetri.V
$ tar xvfz hbase-0.92.1.tar.gz
$HBASE_HOME/bin/start-hbase.sh
configuration changes in Hbase
 Go to hbase-env.sh
 Edit JAVA_HOME
 Next go to hdfs-site.xml and edit the following:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://eattributes:54310/hbase</value>
<description>The directory shared by region servers.
Should be fully-qualified to include the filesystem to use.
E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR
</description>
</property>
<!--
<property>
<name>hbase.master</name>
<value>master:60000</value>
<description>The host and port that the HBase master runs at.
</description>
</property>
-->
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The host and port that the HBase master runs at.
</description>
</property>
</configuration>
Starting hbase shell:
$ hbase shell
hbase(main):001:0> list
TABLE
Prepared by,
Vetri.V
0 row(s) in 0.5710 seconds
General HBase shell commands:
 Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The
default is ‘summary’.
hbase> status
hbase> status ‘simple’
hbase> status ‘summary’
hbase> status ‘detailed’
hbase> version
hbase>whoami
Tables Management commands:
Create a table
hbase(main):002:0> create 'mytable', 'cf'
hbase(main):003:0> list
TABLE
mytable
1 row(s) in 0.0080 seconds
WRITING DATA
hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase'
READING DATA
hbase(main):007:0> get 'mytable', 'first'
hbase(main):008:0> scan 'mytable'
describe table
hbase(main):003:0> describe 'users'
DESCRIPTION ENABLED
{NAME => 'users', FAMILIES => [{NAME => 'info', true ,BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0 , COMPRESSION => 'NONE', VERSIONS => '3', TTL=>
'2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0330 seconds
Disable:
hbase> disable ‘users’
Prepared by,
Vetri.V
Disable_all:
Disable all of tables matching the given regex
hbase>disable_all ‘users.*’
Is_Disabled:
verifies Is named table disabled
hbase>is_disabled ‘users.*’
Drop:
Drop the named table. Table must first be disabled
hbase> drop ‘users’
drop_all:
Drop all of the tables matching the given regex
hbase>drop_all ‘users.*’
Enable:
hbase> enable ‘users’
enable_all:
hbase>enable_all ‘users.*’
is_enabled:
hbase>is_enabled ‘users.*’
exists:
hbase> exists ‘users.*’
list:
hbase> list
hbase> list ‘abc.*’
show_filters:
Show all the filters in hbase.
Prepared by,
Vetri.V
Count:
 Count the number of rows in a table. Return value is the number of rows.
This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar
hbase.jar rowcount’ to run a counting mapreduce job).
 Current count is shown every 1000 rows by default. Count interval may be
optionally specified. Scan caching is enabled on count scans by default. Default cache
size is 10 rows.
If your rows are small in size, you may want to increase this
parameter. Examples:hbase> count ‘users.*’
hbase> count ‘users.*’, INTERVAL => 100000
hbase> count ‘users.*’, CACHE => 1000
hbase> count ‘users.*’, INTERVAL => 10, CACHE => 1000
Put:
hbase> put ‘users, ‘r1, ‘c1’, ‘value’, ts1
Configurable block size
hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKSIZE => '65536'}
Block cache:
 Workloads don’t benefit from putting data into a read cache—for instance, if a
certain table or column family in a table is only accessed for sequential scans or
isn’t accessed a lot and you don’t care if Gets or Scans take a little longer.
 By default, the block cache is enabled. You can disable it at the time of table
creationor by altering the table:
hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKCACHE =>
'false’}
Aggressive caching:
 You can choose some column families to have a higher priority in the block
cache (LRU cache).
 This comes in handy if you expect more random reads on one column
family compared to another. This configuration is also done at table-
instantiation time:
hbase(main):002:0> create 'mytable',
{NAME => 'colfam1', IN_MEMORY => 'true'}
The default value for the IN_MEMORY parameter is false.
Prepared by,
Vetri.V
Bloom filters:
hbase(main):007:0> create 'mytable',{NAME => 'colfam1', BLOOMFILTER =>
'ROWCOL'}
 The default value for the BLOOMFILTER parameter is NONE.
 A row-level bloom filter is enabled with ROW, and a qualifier-level bloom filter is
enabled with ROWCOL.
 The rowlevel bloom filter checks for the non-existence of the particular rowkey in
the block,and the qualifier-level bloom filter checks for the non-existence of the row
and column qualifier combination.
 The overhead of the ROWCOL bloom filter is higher than that of the ROW bloom
filter.
TTL (Time To Live):
 can set the TTL while creating the table like this:
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', TTL => '18000'}
This command sets the TTL on the column family colfam1 as 18,000 seconds = 5
hours. Data in colfam1 that is older than 5 hours is deleted during the next major
compaction.
Compression
 Can enable compression on a column family when creating tables like this:
hbase(main):002:0> create 'mytable',
{NAME => 'colfam1', COMPRESSION => 'SNAPPY'}
Note that data is compressed only on disk. It’s kept uncompressed in memory
(Mem-Store or block cache) or while transferring over the network.
Cell versioning:
 Versions are also configurable at a column family level and can be specified at
the time of table instantiation:
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 1}
hbase(main):002:0> create 'mytable',
{NAME => 'colfam1', VERSIONS => 1, TTL => '18000'}
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 5,
MIN_VERSIONS => '1'}
Description of a table:
Prepared by,
Vetri.V
hbase(main):004:0> describe 'follows'
DESCRIPTION ENABLED
{NAME => 'follows', coprocessor$1 => 'file:///U true
users/ndimiduk/repos/hbaseia twitbase/target/twitbase-
1.0.0.jar|HBaseIA.TwitBase.coprocessors.FollowsObserver|1001|', FAMILIES =>
[{NAME => 'f', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>'0', VERSIONS
=> '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0330 seconds
Tuning HBase:
hbase(main):003:0> help 'status'
SPLITTING TABLES:
hbase(main):019:0> split 'mytable' , 'G'
Alter table
hbase(main):020:0> alter 't', NAME => 'f', VERSIONS => 1
TRUNCATING TABLES:
hbase(main):023:0> truncate 't'
Truncating 't' table (it may take a while):
- Disabling table...
- Dropping table...
- Creating table...
0 row(s) in 14.3190 seconds
THANK YOU…

More Related Content

What's hot

My sql technical reference manual
My sql technical reference manualMy sql technical reference manual
My sql technical reference manual
Mir Majid
 

What's hot (19)

Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Introduction Mysql
Introduction Mysql Introduction Mysql
Introduction Mysql
 
White paper on cassandra
White paper on cassandraWhite paper on cassandra
White paper on cassandra
 
Postgresql
PostgresqlPostgresql
Postgresql
 
lab56_db
lab56_dblab56_db
lab56_db
 
MySQL lecture
MySQL lectureMySQL lecture
MySQL lecture
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 
Mysql all
Mysql allMysql all
Mysql all
 
Hbase an introduction
Hbase an introductionHbase an introduction
Hbase an introduction
 
MySQL
MySQLMySQL
MySQL
 
MYSQL - PHP Database Connectivity
MYSQL - PHP Database ConnectivityMYSQL - PHP Database Connectivity
MYSQL - PHP Database Connectivity
 
Mysql
MysqlMysql
Mysql
 
My sql technical reference manual
My sql technical reference manualMy sql technical reference manual
My sql technical reference manual
 
Mysql ppt
Mysql pptMysql ppt
Mysql ppt
 
Postgresql Database Administration- Day3
Postgresql Database Administration- Day3Postgresql Database Administration- Day3
Postgresql Database Administration- Day3
 
April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!April 2013 HUG: HBase as a Service at Yahoo!
April 2013 HUG: HBase as a Service at Yahoo!
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Advanced MySQL Query Optimizations
Advanced MySQL Query OptimizationsAdvanced MySQL Query Optimizations
Advanced MySQL Query Optimizations
 
Introduction to NoSQL CassandraDB
Introduction to NoSQL CassandraDBIntroduction to NoSQL CassandraDB
Introduction to NoSQL CassandraDB
 

Similar to Hbase

Similar to Hbase (20)

HBASE Overview
HBASE OverviewHBASE Overview
HBASE Overview
 
Hbase
HbaseHbase
Hbase
 
Hbase Quick Review Guide for Interviews
Hbase Quick Review Guide for InterviewsHbase Quick Review Guide for Interviews
Hbase Quick Review Guide for Interviews
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
01 hbase
01 hbase01 hbase
01 hbase
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
 
HBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table designHBase In Action - Chapter 04: HBase table design
HBase In Action - Chapter 04: HBase table design
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Big data hbase
Big data hbase Big data hbase
Big data hbase
 
Hive
HiveHive
Hive
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Hbase
HbaseHbase
Hbase
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 

Hbase

  • 1. Prepared by, Vetri.V WHAT IS HBASE?  HBase is a database: the Hadoop database.It is indexed by rowkey, column key, and timestamp.  HBase stores structured and semistructured data naturally so you can load it with tweets and parsed log files and a catalog of all your products right along with their customer reviews.  It can store unstructured data too, as long as it’s not too large  HBase is designed to run on a cluster of computers instead of a single computer.The cluster can be built using commodity hardware; HBase scales horizontally as you add more machines to the cluster.  Each node in the cluster provides a bit of storage, a bit of cache, and a bit of computation as well. This makes HBase incredibly flexible and forgiving. No node is unique, so if one of those machines breaks down, you simply replace it with another.  This adds up to a powerful, scalable approach to data that,until now, hasn’t been commonly available to mere mortals. HBASE DATA MODEL: Hbase Data model - These six concepts form the foundation of HBase. Table:  HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path. Row :  Within a table, data is stored according to its row. Rows are identified uniquely by their rowkey. Rowkeys don’t have a data type and are always treated as a byte[]. Column family:  Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase.  For this reason, they must be defined up front and aren’t easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column family names are Strings and composed of characters that are safe for use in a file system path. Column qualifier:  Data within a column family is addressed via its column qualifier,or column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows.  Like rowkeys, column qualifiers don’t have a data type and are always treated as a byte[].
  • 2. Prepared by, Vetri.V Cell:  A combination of rowkey, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also don’t have a data type and are always treated as a byte[]. Version:  Values within a cell are versioned. Versions are identified by their timestamp,a long. When a version isn’t specified, the current timestamp is used as the basis for the operation. The number of cell value versions retained by HBase is configured via the column family. The default number of cell versions is three. Hbase Architecture HBase Tables and Regions Table is made up of any number of regions. Region is specified by its startKey and endKey.  Empty table: (Table, NULL, NULL)  Two-region table: (Table, NULL, “com.ABC.www”) and (Table, “com.ABC.www”, NULL) Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop HBase Tables:-  Tables are sorted by Row in lexicographical order  Table schema only defines its column families  Each family consists of any number of columns  Each column consists of any number of versions  Columns only exist when inserted, NULLs are free  Columns within a family are sorted and stored together  Everything except table names are byte[]
  • 3. Prepared by, Vetri.V  Hbase Table format (Row, Family:Column, Timestamp) -> Value HBase uses HDFS as its reliable storage layer.It Handles checksums, replication, failover Hbase consists of,  Java API, Gateway for REST, Thrift, Avro  Master manages cluster  RegionServer manage data  ZooKeeper is used the “neural network” and coordinates cluster Data is stored in memory and flushed to disk on regular intervals or based on size  Small flushes are merged in the background to keep number of files small  Reads read memory stores first and then disk based files second  Deletes are handled with “tombstone” markers MemStores:- After data is written to the WAL the RegionServer saves KeyValues in memory store  Flush to disk based on size, is hbase.hregion.memstore.flush.size  Default size is 64MB  Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time Compactions:- Two types: Minor and Major Compactions Minor Compactions  Combine last “few” flushes  Triggered by number of storage files Major Compactions  Rewrite all storage files  Drop deleted data and those values exceeding TTL and/or number of versions Key Cardinality:- The best performance is gained from using row keys
  • 4. Prepared by, Vetri.V  Time range bound reads can skip store files  So can Bloom Filters  Selecting column families reduces the amount of data to be scanned Fold, Store, and Shift:- All values are stored with the full coordinates,including: Row Key, Column Family, Column Qualifier, and Timestamp  Folds columns into “row per column”  NULLs are cost free as nothing is stored  Versions are multiple “rows” in folded table DDI:- Stands for Denormalization, Duplication and Intelligent Keys Block Cache Region Splits Hbase shell and Commands Hbase Install $ mkdir hbase-install $ cd hbase-install $ wget http://apache.claz.org/hbase/hbase-0.92.1/hbase-0.92.1.tar.gz
  • 5. Prepared by, Vetri.V $ tar xvfz hbase-0.92.1.tar.gz $HBASE_HOME/bin/start-hbase.sh configuration changes in Hbase  Go to hbase-env.sh  Edit JAVA_HOME  Next go to hdfs-site.xml and edit the following: <configuration> <property> <name>hbase.rootdir</name> <value>hdfs://eattributes:54310/hbase</value> <description>The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR </description> </property> <!-- <property> <name>hbase.master</name> <value>master:60000</value> <description>The host and port that the HBase master runs at. </description> </property> --> <property> <name>hbase.cluster.distributed</name> <value>true</value> <description>The host and port that the HBase master runs at. </description> </property> </configuration> Starting hbase shell: $ hbase shell hbase(main):001:0> list TABLE
  • 6. Prepared by, Vetri.V 0 row(s) in 0.5710 seconds General HBase shell commands:  Show cluster status. Can be ‘summary’, ‘simple’, or ‘detailed’. The default is ‘summary’. hbase> status hbase> status ‘simple’ hbase> status ‘summary’ hbase> status ‘detailed’ hbase> version hbase>whoami Tables Management commands: Create a table hbase(main):002:0> create 'mytable', 'cf' hbase(main):003:0> list TABLE mytable 1 row(s) in 0.0080 seconds WRITING DATA hbase(main):004:0> put 'mytable', 'first', 'cf:message', 'hello HBase' READING DATA hbase(main):007:0> get 'mytable', 'first' hbase(main):008:0> scan 'mytable' describe table hbase(main):003:0> describe 'users' DESCRIPTION ENABLED {NAME => 'users', FAMILIES => [{NAME => 'info', true ,BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0 , COMPRESSION => 'NONE', VERSIONS => '3', TTL=> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0330 seconds Disable: hbase> disable ‘users’
  • 7. Prepared by, Vetri.V Disable_all: Disable all of tables matching the given regex hbase>disable_all ‘users.*’ Is_Disabled: verifies Is named table disabled hbase>is_disabled ‘users.*’ Drop: Drop the named table. Table must first be disabled hbase> drop ‘users’ drop_all: Drop all of the tables matching the given regex hbase>drop_all ‘users.*’ Enable: hbase> enable ‘users’ enable_all: hbase>enable_all ‘users.*’ is_enabled: hbase>is_enabled ‘users.*’ exists: hbase> exists ‘users.*’ list: hbase> list hbase> list ‘abc.*’ show_filters: Show all the filters in hbase.
  • 8. Prepared by, Vetri.V Count:  Count the number of rows in a table. Return value is the number of rows. This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount’ to run a counting mapreduce job).  Current count is shown every 1000 rows by default. Count interval may be optionally specified. Scan caching is enabled on count scans by default. Default cache size is 10 rows. If your rows are small in size, you may want to increase this parameter. Examples:hbase> count ‘users.*’ hbase> count ‘users.*’, INTERVAL => 100000 hbase> count ‘users.*’, CACHE => 1000 hbase> count ‘users.*’, INTERVAL => 10, CACHE => 1000 Put: hbase> put ‘users, ‘r1, ‘c1’, ‘value’, ts1 Configurable block size hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKSIZE => '65536'} Block cache:  Workloads don’t benefit from putting data into a read cache—for instance, if a certain table or column family in a table is only accessed for sequential scans or isn’t accessed a lot and you don’t care if Gets or Scans take a little longer.  By default, the block cache is enabled. You can disable it at the time of table creationor by altering the table: hbase(main):002:0> create 'mytable',{NAME => 'colfam1', BLOCKCACHE => 'false’} Aggressive caching:  You can choose some column families to have a higher priority in the block cache (LRU cache).  This comes in handy if you expect more random reads on one column family compared to another. This configuration is also done at table- instantiation time: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', IN_MEMORY => 'true'} The default value for the IN_MEMORY parameter is false.
  • 9. Prepared by, Vetri.V Bloom filters: hbase(main):007:0> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}  The default value for the BLOOMFILTER parameter is NONE.  A row-level bloom filter is enabled with ROW, and a qualifier-level bloom filter is enabled with ROWCOL.  The rowlevel bloom filter checks for the non-existence of the particular rowkey in the block,and the qualifier-level bloom filter checks for the non-existence of the row and column qualifier combination.  The overhead of the ROWCOL bloom filter is higher than that of the ROW bloom filter. TTL (Time To Live):  can set the TTL while creating the table like this: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', TTL => '18000'} This command sets the TTL on the column family colfam1 as 18,000 seconds = 5 hours. Data in colfam1 that is older than 5 hours is deleted during the next major compaction. Compression  Can enable compression on a column family when creating tables like this: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', COMPRESSION => 'SNAPPY'} Note that data is compressed only on disk. It’s kept uncompressed in memory (Mem-Store or block cache) or while transferring over the network. Cell versioning:  Versions are also configurable at a column family level and can be specified at the time of table instantiation: hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 1} hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 1, TTL => '18000'} hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 5, MIN_VERSIONS => '1'} Description of a table:
  • 10. Prepared by, Vetri.V hbase(main):004:0> describe 'follows' DESCRIPTION ENABLED {NAME => 'follows', coprocessor$1 => 'file:///U true users/ndimiduk/repos/hbaseia twitbase/target/twitbase- 1.0.0.jar|HBaseIA.TwitBase.coprocessors.FollowsObserver|1001|', FAMILIES => [{NAME => 'f', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>'0', VERSIONS => '1', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0330 seconds Tuning HBase: hbase(main):003:0> help 'status' SPLITTING TABLES: hbase(main):019:0> split 'mytable' , 'G' Alter table hbase(main):020:0> alter 't', NAME => 'f', VERSIONS => 1 TRUNCATING TABLES: hbase(main):023:0> truncate 't' Truncating 't' table (it may take a while): - Disabling table... - Dropping table... - Creating table... 0 row(s) in 14.3190 seconds THANK YOU…