Keynote: The Future of Apache HBase

The Future of HBase
Lars Hofhansl
Principal Architect & VP, Salesforce
HBase PMC & Committer
Phoenix PMC & Committer
Apache Member

It’s us*
We define our future
* people in this room, developers, contributors, comments on mailing list, committers, PMC members, etc

Kafka
Spark
Cassandra
HBase
: Confluent
: Databricks
: Datastax
: ????

Kafka
Spark
Cassandra
HBase
: Confluent
: Databricks
: Datastax
: Cloudera. Hortonworks.

Kafka
Spark
Cassandra
HBase :
: Confluent
: Databricks
: Datastax
Adobe, Alibaba, Apple, Cask, Cloudera, Facebook, Google,
Hortonworks, Huawei, HubSpot, IBM, Intel, NGDATA,
Salesforce, The Gap, Twitter, Xiaomi, Yahoo!, etc, etc….

What does Cloud
mean for HBase’s
future?

More problems? More
work to do? Rather...

HBase on GCP
HBase on Dataproc HBase on Cloud Bigtable

HBase on GCP
HBase on Dataproc HBase on Cloud Bigtable
HBase Client (>= v1.0)

Why is HBase the
client for Cloud
Bigtable?

Why HBase?
#1: Open source is the de facto
way that standards are defined
now.
Committers
Not committees

Why HBase?
#2: HBase is indisputably the
best open source
implementation the Bigtable
architecture.
Bigtable
HBase

Why HBase?
#3: Because supporting an
ecosystem is the right thing.
Technology needs a rich
community to flourish.

Rich abstractions on top of HBase
Future big data customers need fully formed solutions:
A great graph database
A great IoT solution
A great geo solution
And so on...
Open source. Each with the scale of HBase.
And we want to help, with engineering time and code.

But there’s already a
great _______ HBase
solution that could use
some love!

Please email me with ideas. (Really.)
carterp
(at)
google.com
Have a great open source HBase integration that
could use some Google engineering help?

Maxim Lukiyanov, Microsoft
Cloud?

CPU utilization
Typical picture in pure key/value
stores

Unutilized CPU!
Run something else on it (Analytics on Hbase anybody?)
Give it back to the cloud

“What is your biggest mistake as an engineer?
Not putting distributed transactions in BigTable. If you wanted to update more than one
row you had to roll your own transaction protocol. It wasn’t put in because it would have
complicated the system design. In retrospect lots of teams wanted that capability and
built their own with different degrees of success. We should have implemented
transactions in the core system. It would have been useful internally as well. Spanner
fixed this problem by adding transactions.”
- Jeff Dean, March 7th, 2016

John Leach
Founder & CTO
Call for Founders!!!
Be part of bringing Splice Machine to Open Source
Splice Machine, the first dual-engine RDBMS on HBase and Spark, is headed to open-source and we are looking for some key individuals to be
founders to support the transition.

Current Storage Challenges
Lack of Transactions (see Dean Quote)
Single Write Optimized Store: Log Structured Merge Tree
Limited Metadata Facilities
Current Execution Challenges
OLTP: Limited/Rigid Concurrency Model
OLAP: Foggy Execution Model
Remote Client Scans (Slow)
Internal Scans via Coprocessor (In JVM)
Custom Rolled Data Flow Engine (Yikes)
Maintenance Operations
Do not talk about Fight Club (Compactions)

Future Storage Approach (Code Named: Janus)
Typed Storage System
JSON first class citizen
Serde based on Spark UnsafeRow
Hierarchical, Partition Aware Transactions
Partitions: Within and Across Data Centers
Write Optimized Store (Optional)
LSM Tree
Read Optimized Store (Optional)
Positional Delta Trees, Columnar
Full Metadata Facilities (https://datasketches.github.io/)
Theta Sketch, Quantiles, Frequent Items

Future Execution Approach (Dual Engine)
All Execution Engines
Statistical Hooks (Sketching Algorithms)
OLAP Execution Engines
Spark, Flink, MapReduce, Impala etc.
YARN, Fair Scheduling
Transactional Input/Output Formats
File System based with incremental memstore deltas
Columnar
Support Arrow, Calcite
Perform Compactions (yes, it works)
OLTP Execution Engines
Row Based Storage, Remote HBase Scans

Matt Mullins, Facebook
Lars Hofhansl, Salesforce

Salesforce Single-SKU project
We used to have 30+ different SKUs
Now there is one SKU (almost) for all projects
1U, 10Ge everywhere, FAT networking tree (no/little oversubscription)
Same SKU used by all projects
Very few exceptions: High storage SKU and high compute SKU
Vendor: varies
Allows us to order/repurpose in large quantities and then assign to projects
Compromise for individual projects, but cheaper overall
FAT network -> location independence

HBase 2.0Matteo Bertozzi, Cloudera

We are trying to avoid another singularity (like 0.94 to 0.96)
(almost) Rolling Upgradable from 1.x
Wire Compatible with 1.x
Possible Features
HBASE-11425 - Off-Heap for read and write path
HBASE-13773 - Replication off Zookeeper and ReplicationAdmin with ACLs support
HBASE-14070 - Hybrid-Logical Clocks
HBASE-14123 - Backups
HBase 2.0

“What makes HBase…
truly Special?”

The (Big) Landscape
Cassandra
CouchDB
DB2
Hana
HBase
Hive
Hypertable
Impala
Kudu
LevelDB
MySQL
RedShift
RocksDB MongoDB
Oracle
PostgreSQL
SOLR SleepyCat
SQLLite
SQLServer
Voldemort

The HBase Sweet Spot
1. Scales single clusters to 100’s or 1000’s of commodity machines
2. Small Scans (<100m rows) and Gets
3. Operations are harder, but amortized over large (>15 node) installs
4. Consistent, OpenSource, OnPrem, Cloud
5. A foundational, general purpose, low latency storage engine
There is no system that handles analytical and OLTP workloads well
There is no replacement for HBase in this sweet spot

Future work
Using large RAM effectively
Off-heaping everything
In-memory compactions
Optimizing lock contention to utilize all cores, on SSDs, 10Ge or more
Scaling assignment manager
Spark integration for large scans, OLAP
Multi-tenancy
Sister projects such as Phoenix, for easy interfacing
Easier operations to ease on-boarding

Keynote: The Future of Apache HBase

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Keynote: The Future of Apache HBase

Ähnlich wie Keynote: The Future of Apache HBase (20)

Mehr von HBaseCon

Mehr von HBaseCon (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Keynote: The Future of Apache HBase

Hinweis der Redaktion