Based concepts to implement Create, Read, Update, Delete operations on HBase over Java API.
Follow us at LinkedIn: https://www.linkedin.com/groups?home=&gid=8104884
3. Intro
A rowkey primarily represents each row uniquely in the HBase table, whereas other
keys such as column family, timestamp, and so on are used to locate a piece of data
in an HBase table. The HBase API provides the following methods to support the
CRUD operations:
• Put
• Get
• Delete
• Scan
• Increment
You could find source code for this presentation on github:
https://github.com/EugeneYushin/HBase-CRUD
4. Create
Table creates in ‘Enabled’ state. Check table creation in Hue (Cloudera CDH 5.1.0) and hbase shell
6. Insert
All manipulations with table implements through
HTableInterface. HTable represents particular table in
Hbase.
The HTable class is not thread-safe as concurrent
modifications are not safe. Hence, a single instance
of HTable for each thread should be used in any
application. For multiple HTable instances with the
same configuration reference, the same underlying
HConnection instance can be used.
RowKey is main point to consider when configuring
table structure. Use compound RowKey with SHA1,
MD5 hashing algorithms (with additional reverse
timestamp part) as Hbase store data sorted.
7. Update
Data in Hbase is versioned, by default there’re last 3 values stored into column.
Use HColumnDescriptor.setMaxVersions(n) method to overwrite this value.
9. Read – Table Scan
Table Scan...
PaulRK Paul paul01@mail.com
10. Read – Get Field
Get particular Field...
rowKey = MikeRK, user_name: Mike
rowKey = MikeRK, user_mail: mike@mail.com
11. Conclusions
• HTable is expensive
Creating HTable instances also comes at a cost. Creating an HTable instance is a slow process as the creation of each HTable instance involves the scanning of
the .META table to check whether the table actually exists, which makes the operation very costly. Hence, it is not recommended that you use a new HTable
instance for each request where the number of concurrent requests are very high
• Scan cashing
A scan can be configured to retrieve a batch of rows in every RPC call it makes to HBase. This configuration can be done at a per-scanner level by using the
setCaching(int) API on the scan object. This configuration can also be set in the hbasesite.xml configuration file using the hbase.client.scanner.caching
property
• Increment
Increment Column Value (ICV). It’s exposed as both the Increment command object like the others but also as a method on the HTableInterface. This
command allows you to change an integral value stored in an HBase cell without reading it back first. The data manipulation happens in HBase, not in your
client application, which makes it fast. It also avoids a possible race condition where some other client is interacting with the same cell.
• Filter
A filter is a predicate that executes in HBase instead of on the client. When you specify a Filter in your Scan, HBase uses it to determine whether a record
should be returned. This can avoid a lot of unnecessary data transfer. It also keeps the filtering on the server instead of placing that burden on the client. The
filter applied is anything implementing the org.apache.hadoop.hbase.filter.Filter interface. HBase provides a number of filters, but it’s easy to implement
your own.