Moderated by Lars Hofhansl (Salesforce), with Matteo Bertozzi (Cloudera), John Leach (Splice Machine), Maxim Lukiyanov (Microsoft), Matt Mullins (Facebook), and Carter Page (Google)
The future of HBase, via a variety of viewpoints.
21. Rich abstractions on top of HBase
Future big data customers need fully formed solutions:
A great graph database
A great IoT solution
A great geo solution
And so on...
Open source. Each with the scale of HBase.
And we want to help, with engineering time and code.
22. But there’s already a
great _______ HBase
solution that could use
some love!
23. Please email me with ideas. (Really.)
carterp
(at)
google.com
Have a great open source HBase integration that
could use some Google engineering help?
34. “What is your biggest mistake as an engineer?
Not putting distributed transactions in BigTable. If you wanted to update more than one
row you had to roll your own transaction protocol. It wasn’t put in because it would have
complicated the system design. In retrospect lots of teams wanted that capability and
built their own with different degrees of success. We should have implemented
transactions in the core system. It would have been useful internally as well. Spanner
fixed this problem by adding transactions.”
- Jeff Dean, March 7th, 2016
35. John Leach
Founder & CTO
Call for Founders!!!
Be part of bringing Splice Machine to Open Source
Splice Machine, the first dual-engine RDBMS on HBase and Spark, is headed to open-source and we are looking for some key individuals to be
founders to support the transition.
37. Current Storage Challenges
Lack of Transactions (see Dean Quote)
Single Write Optimized Store: Log Structured Merge Tree
Limited Metadata Facilities
Current Execution Challenges
OLTP: Limited/Rigid Concurrency Model
OLAP: Foggy Execution Model
Remote Client Scans (Slow)
Internal Scans via Coprocessor (In JVM)
Custom Rolled Data Flow Engine (Yikes)
Maintenance Operations
Do not talk about Fight Club (Compactions)
38. Future Storage Approach (Code Named: Janus)
Typed Storage System
JSON first class citizen
Serde based on Spark UnsafeRow
Hierarchical, Partition Aware Transactions
Partitions: Within and Across Data Centers
Write Optimized Store (Optional)
LSM Tree
Read Optimized Store (Optional)
Positional Delta Trees, Columnar
Full Metadata Facilities (https://datasketches.github.io/)
Theta Sketch, Quantiles, Frequent Items
39. Future Execution Approach (Dual Engine)
All Execution Engines
Statistical Hooks (Sketching Algorithms)
OLAP Execution Engines
Spark, Flink, MapReduce, Impala etc.
YARN, Fair Scheduling
Transactional Input/Output Formats
File System based with incremental memstore deltas
Columnar
Support Arrow, Calcite
Perform Compactions (yes, it works)
OLTP Execution Engines
Row Based Storage, Remote HBase Scans
42. Salesforce Single-SKU project
We used to have 30+ different SKUs
Now there is one SKU (almost) for all projects
1U, 10Ge everywhere, FAT networking tree (no/little oversubscription)
Same SKU used by all projects
Very few exceptions: High storage SKU and high compute SKU
Vendor: varies
Allows us to order/repurpose in large quantities and then assign to projects
Compromise for individual projects, but cheaper overall
FAT network -> location independence
44. We are trying to avoid another singularity (like 0.94 to 0.96)
(almost) Rolling Upgradable from 1.x
Wire Compatible with 1.x
Possible Features
HBASE-11425 - Off-Heap for read and write path
HBASE-13773 - Replication off Zookeeper and ReplicationAdmin with ACLs support
HBASE-14070 - Hybrid-Logical Clocks
HBASE-14123 - Backups
HBase 2.0
52. The HBase Sweet Spot
1. Scales single clusters to 100’s or 1000’s of commodity machines
2. Small Scans (<100m rows) and Gets
3. Operations are harder, but amortized over large (>15 node) installs
4. Consistent, OpenSource, OnPrem, Cloud
5. A foundational, general purpose, low latency storage engine
There is no system that handles analytical and OLTP workloads well
There is no replacement for HBase in this sweet spot
53. Future work
Using large RAM effectively
Off-heaping everything
In-memory compactions
Optimizing lock contention to utilize all cores, on SSDs, 10Ge or more
Scaling assignment manager
Spark integration for large scans, OLAP
Multi-tenancy
Sister projects such as Phoenix, for easy interfacing
Easier operations to ease on-boarding
Top image from HBaseCon website.
Bottom image is public domain: https://en.wikipedia.org/wiki/Democratic_National_Committee#/media/File:Chicago_delegation_to_the_January_8,_1912_meeting_of_the_Democratic_National_Committee.jpg
Images created by Google.
Open source results in a richer technology stacks.
(Image free for commercial use https://pixabay.com/en/zebra-gnu-giraffe-africa-namibia-1170177/)