2. How RDBMS on Hadoop is Different
From Traditional RDBMS and NOSQL?
NOSQL Database don’t have SQL interface,
no join, no transaction on multiple rows and
tables.
Existing application of database need to
rewrite to use NOSQL Database.
Traditional RDBMS can’t automatically scale
out on commodity hardware and must
manually shard across server.
Hadoop RDBMS eliminate the cost and
scaling issue of traditional RDBMS and start
supporting SQL interface over NOSQL
Database.
Existing application can well be migrated .
3. Splice Machine Overview
Splice Machine is a SQL on Hadoop RDBMS.
Splice Machine provides the database technology for real-time, including these features:
A. Standard ANSI SQL
B. Horizontal Scale Out
C. Real-Time Updates With Transaction
D. Massively Parallel Architecture
4. Splice Machine Becoming Real Time
Many companies are experiencing an explosion of data generated by applications, websites,
users, and devices such as smartphones.
Companies recognize that insights contained with this data can be a source of real competitive
advantage, compelling them to act quickly before those insights become obsolete.
However, traditional relational databases, NoSQL alternatives, and other SQL-on-Hadoop
solutions don't allow companies to collect, analyze, and react to massive amounts of data in
real-time.
5. Standard ANSI SQL-99
Splice Machine is an ANSI SQL-compliant database on Hadoop that enables companies to
leverage existing SQL-trained resources over it.
6. Horizontal Scale Out
HBase support auto-sharding due to which it
can have massive scalability .
Traditional RDBMS would like to do scale-up
that comes out to be costly as compare to
commodity hardware.
Splice Machine with the help of HBase to
scale-out instead of scaling up to provide
massive scalability across commodity
hardware, even up to dozens of petabytes.
7. Real Time Updates with Transaction
Splice Machine supports SQL interface due to
which it can perform transactions on multiple
rows and tables.
How it can happen in real time? Because of
HBase Distributed Database over Hadoop that
allow real time read/write access using HBase
co-processor rather than using MapReduce
(Batch Processing ).
Transactional consistency is maintained by
Multiple Version Control Concurrency.
8. Massively Parallel Architecture
Splice Machine delivers massive parallelization by placing its Parser, Planner, and Optimizer on
each HBase RegionServer, which support multiple Regions and Executor on HBase Region,
pushing computation down to each distributed data shard (HBase Region) .
Splice Machine provide high performance by using Massive Parallel Processing for performing
Predicates, Joins, Aggregation and complex query by pushing down to Data shard.
For Parallelized query execution , splice machine utilizes HBase co-processor for distributed
computation on data stored in Hadoop Distributed File System (HDFS).
9. How Splice Machine is Different From
Other SQL on Hadoop?
Splice Machine is fully operational database on Hadoop that support:
A. Real Time Updates
B. Transaction
C. Analytics
D. Rich SQL Support by using ANSI SQL 99
Other SQL on Hadoop such as Hortonwork Stinger, Apache Drill, Cloudera Impala are query
analytics engine that have limited SQL support, no transaction, no real time updates.
11. Proven Building Blocks: HBase/HADOOP
and DERBY
Splice Machine marries two proven technology
stacks: Apache Derby and HBase/Hadoop.
A. Apache Derby: Java-Based, ANSI SQL Database
• JAVA Based
• ANSI SQL-99
• Lightweight 2.6 MB footprint
B. Apache HBase /HDFS
• Auto- sharding
• Data Replication
• Scalability to 100s of PB
• Real Time Updates
13. Splice Modification To Derby
Derby Component Derby Splice Machine
Store Block File Based HBase
Index B-Tree Dense Index in HBase
Concurrency Lock Based MVCC
Join Plan Centralized Hash and Nested
Loop Join
Sort Merge, Merge, Nested loop,
Distributed Broadcast
14. How Derby and HBase work together for
Splice Machine?
Splice Machine replace the Block file based storage engine of Apache Derby with HBase.
Splice Machine uses the same Parser of Apache Derby and redesign the Planner, Optimizer and
Executor so that they can work well and take advantages over Distributed HBase computation.
Redesign enable splice machine database to achieve massive parallel processing by pushing
computation down to each HBase region on regionserver and utilize HBase co-processor for data
computation in HDFS.
Client able to send SQL query to Apache derby parser then it flows to redesigned planner,
optimizer and executor which resides in HBase region .
Apache Derby is JAVA based so each region server try to reference local jar files of parser,
planner, optimizer and each region of region server reference local jar file of executor.
15. Splice SQL Processing
It is similar to Apache Derby parser , splice does not redesign it.
PreparedStatement ps=conn.prepareStatement(“SELECT * FROM T WHERE ID=?”);
1. Look up in cache using text match
• If it is found then skip all further 5 steps
• Else perform 5 steps
2. Parse using JAVACC generator(JAVA Compiler Compiler) convert it into abstract syntax tree.
3. Bind all tables associated with query.
4. Optimize plan based on cost of I/O ,Communication cost ,Disk Usage and feasible join strategies .
5. Generate of code to represent statement plan.
6. Loading of the class and creation of an instance to represent that connection's state of the query
16. Distributed, Parallelized Query Execution
Parallel Computation across the cluster
Move computation to data shard
Utilize HBase co-processor
No MapReduce
Query uses special operator “Exchange
Operator “ for parallelism
17. HBase Co-Processor Verses MapReduce For
Distributed Computation On Data Stored In
HDFS
HBase access HDFS directly while maintain its own metadata to quickly find out single record in
HDFS files .
MapReduce is designed for batch data access and therefore would not appropriate for real
time data access .
MapReduce start Java Virtual Machine for each query , which can take up to 30 sec even to
retrieve single record from HDFS files.
MapReduce without metadata will scan all the data ,even if your query need to access a few
records.
Co-processor of HBase run on each RegionServer and region take reference to co-processor.
Co-processor provide region life cycle management by open, close, split, flush and perform
compact operation.
18. HBase : Proven To Be Scale Out
Auto Sharding.
Scale with commodity hardware.
Cost effective from GBs to PBs.
High availability through replication.
19. Support of Secondary Index
Often data is organized along one dimension for fast updating (such as a customer number) but
later must be looked up by other dimensions (such as zip code). Secondary indexes enable
databases to lookup data across many dimensions efficiently.
Splice machine use HBase table to store index as well as any required data.
20. Splice Transaction
Splice Machine is a fully transactional database. This allows you to perform actions such as commit
and rollback, which means that in a transactional context, the database does not make changes
visible to others until a commit has been issued.
Here is a simple example. Enter the following commands to see commit and rollback in action:
splice> create table a (i int);
splice> autocommit off; -- puts current shell into a transactional context
splice> insert into a values 1,2,3; -- inserted but not visible to others
splice> commit; -- now committed to the database
splice> select * from a;
splice> insert into a values 4,5;
splice> rollback; -- 4 and 5 rolled back splice> select * from a; ...
21. Snapshot Isolation In Transaction
Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot
of the database (in practice it reads the last committed values that existed at the time it started), and
the transaction itself will successfully commit only if no updates it has made conflict with any
concurrent updates made since that snapshot. Such a write-write conflict will cause the transaction to
abort.
Snapshot isolation is implemented within Multiversion concurrency control (MVCC)
• MVCC is a common way to increase concurrency and performance by generating a new version of a database
object each time the object is written, and allowing transactions' read operations of several last relevant
versions (of each object).
In a write skew anomaly, two transactions (T1 and T2) concurrently read an overlapping data set
(e.g. values V1 and V2), concurrently make disjoint updates (e.g. T1 updates V1, T2 updates V2), and
finally concurrently commit, neither having seen the update performed by the other. Were the system
serializable, such an anomaly would be impossible, as either T1 or T2 would have to occur "first", and
be visible to the other. In contrast, snapshot isolation permits write skew anomalies.
23. Splice Machine Support Distributed
Transaction
Splice Machine has added Asynchronous Write Pipeline to HBase.
Splice Machine also have nested sub transaction to ensure region level failure does not force to
restart whole transaction.
• Example –Suppose 10TB update transaction is there that would act as single parent transaction and
when it get divided among the shards then it will become nested transaction for each shard, a failure at
a region level would typically restart only few GB instead of 10 TB.
24. Splice Machine Efficiency
Can it efficiently handle sparse data?
• In many large data sets, each attribute or column may be sparsely populated. In traditional databases,
an empty value must still be stored as a null, which consumes storage. Modern databases should not
require nulls for empty values.
Can you add a column without table scans?
• Data requirements change frequently and often require schema changes. Adding a column should not
require full table scans.
25. Splice Machine Performance
Does it support secondary indexes?
• Often data is organized along one dimension for fast updating (such as a customer number) but later
must be looked up by other dimensions (such as zip code). Secondary indexes enable databases to
lookup data across many dimensions efficiently.
Does it provide multiple join strategies?
• Joins combine data from multiple tables. With a distributed infrastructure like Hadoop that handles very
large data sets, multiple join strategies such as nested loop, sort-merge, and broadcast joins are needed
to ensure fast join performance.
Is there a cost-based optimizer?
• Performance on large data sets greatly depends on choosing the right execution strategy. Simple rules-
based optimizers are not enough. Cost-based optimizers looks for the actual cost to execute a query are
critical to optimal query performance.
26. Splice Machine Feature In Upcoming
Release
In many applications, certain attributes on a record may be visible to one user, but not to
another. For instance in an HR application, a CEO may get to see the salary field, while most
employees would not. Many applications control data access directly, but column level security
is an advanced database feature that enables the database to control which fields a user can
view. Splice Machine will be adding this feature in an upcoming release.
Hinweis der Redaktion
Asynchronous Write Pipeline allow maximum write parallelization across HBase nodes