SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Splice Machine
SQL ON HADOOP RELATIONAL DATABASE MANAGEMENT SYSTEM
How RDBMS on Hadoop is Different
From Traditional RDBMS and NOSQL?
NOSQL Database don’t have SQL interface,
no join, no transaction on multiple rows and
tables.
Existing application of database need to
rewrite to use NOSQL Database.
Traditional RDBMS can’t automatically scale
out on commodity hardware and must
manually shard across server.
Hadoop RDBMS eliminate the cost and
scaling issue of traditional RDBMS and start
supporting SQL interface over NOSQL
Database.
Existing application can well be migrated .
Splice Machine Overview
Splice Machine is a SQL on Hadoop RDBMS.
Splice Machine provides the database technology for real-time, including these features:
A. Standard ANSI SQL
B. Horizontal Scale Out
C. Real-Time Updates With Transaction
D. Massively Parallel Architecture
Splice Machine Becoming Real Time
Many companies are experiencing an explosion of data generated by applications, websites,
users, and devices such as smartphones.
Companies recognize that insights contained with this data can be a source of real competitive
advantage, compelling them to act quickly before those insights become obsolete.
However, traditional relational databases, NoSQL alternatives, and other SQL-on-Hadoop
solutions don't allow companies to collect, analyze, and react to massive amounts of data in
real-time.
Standard ANSI SQL-99
Splice Machine is an ANSI SQL-compliant database on Hadoop that enables companies to
leverage existing SQL-trained resources over it.
Horizontal Scale Out
HBase support auto-sharding due to which it
can have massive scalability .
Traditional RDBMS would like to do scale-up
that comes out to be costly as compare to
commodity hardware.
Splice Machine with the help of HBase to
scale-out instead of scaling up to provide
massive scalability across commodity
hardware, even up to dozens of petabytes.
Real Time Updates with Transaction
Splice Machine supports SQL interface due to
which it can perform transactions on multiple
rows and tables.
How it can happen in real time? Because of
HBase Distributed Database over Hadoop that
allow real time read/write access using HBase
co-processor rather than using MapReduce
(Batch Processing ).
Transactional consistency is maintained by
Multiple Version Control Concurrency.
Massively Parallel Architecture
Splice Machine delivers massive parallelization by placing its Parser, Planner, and Optimizer on
each HBase RegionServer, which support multiple Regions and Executor on HBase Region,
pushing computation down to each distributed data shard (HBase Region) .
Splice Machine provide high performance by using Massive Parallel Processing for performing
Predicates, Joins, Aggregation and complex query by pushing down to Data shard.
For Parallelized query execution , splice machine utilizes HBase co-processor for distributed
computation on data stored in Hadoop Distributed File System (HDFS).
How Splice Machine is Different From
Other SQL on Hadoop?
Splice Machine is fully operational database on Hadoop that support:
A. Real Time Updates
B. Transaction
C. Analytics
D. Rich SQL Support by using ANSI SQL 99
Other SQL on Hadoop such as Hortonwork Stinger, Apache Drill, Cloudera Impala are query
analytics engine that have limited SQL support, no transaction, no real time updates.
Splice Machine
Architecture
Proven Building Blocks: HBase/HADOOP
and DERBY
Splice Machine marries two proven technology
stacks: Apache Derby and HBase/Hadoop.
A. Apache Derby: Java-Based, ANSI SQL Database
• JAVA Based
• ANSI SQL-99
• Lightweight 2.6 MB footprint
B. Apache HBase /HDFS
• Auto- sharding
• Data Replication
• Scalability to 100s of PB
• Real Time Updates
Apache Derby
100% Java ANSI SQL RDBMS- Client,
Embedded.
Java Stored Procedure
Full Transaction Isolation Snapshot
2.6 MB Footprint
Custom Function
Authentication and Authorization
Concurrency (Lock Based)
Splice Modification To Derby
Derby Component Derby Splice Machine
Store Block File Based HBase
Index B-Tree Dense Index in HBase
Concurrency Lock Based MVCC
Join Plan Centralized Hash and Nested
Loop Join
Sort Merge, Merge, Nested loop,
Distributed Broadcast
How Derby and HBase work together for
Splice Machine?
Splice Machine replace the Block file based storage engine of Apache Derby with HBase.
Splice Machine uses the same Parser of Apache Derby and redesign the Planner, Optimizer and
Executor so that they can work well and take advantages over Distributed HBase computation.
Redesign enable splice machine database to achieve massive parallel processing by pushing
computation down to each HBase region on regionserver and utilize HBase co-processor for data
computation in HDFS.
Client able to send SQL query to Apache derby parser then it flows to redesigned planner,
optimizer and executor which resides in HBase region .
Apache Derby is JAVA based so each region server try to reference local jar files of parser,
planner, optimizer and each region of region server reference local jar file of executor.
Splice SQL Processing
It is similar to Apache Derby parser , splice does not redesign it.
PreparedStatement ps=conn.prepareStatement(“SELECT * FROM T WHERE ID=?”);
1. Look up in cache using text match
• If it is found then skip all further 5 steps
• Else perform 5 steps
2. Parse using JAVACC generator(JAVA Compiler Compiler) convert it into abstract syntax tree.
3. Bind all tables associated with query.
4. Optimize plan based on cost of I/O ,Communication cost ,Disk Usage and feasible join strategies .
5. Generate of code to represent statement plan.
6. Loading of the class and creation of an instance to represent that connection's state of the query
Distributed, Parallelized Query Execution
Parallel Computation across the cluster
Move computation to data shard
Utilize HBase co-processor
No MapReduce
Query uses special operator “Exchange
Operator “ for parallelism
HBase Co-Processor Verses MapReduce For
Distributed Computation On Data Stored In
HDFS
HBase access HDFS directly while maintain its own metadata to quickly find out single record in
HDFS files .
MapReduce is designed for batch data access and therefore would not appropriate for real
time data access .
MapReduce start Java Virtual Machine for each query , which can take up to 30 sec even to
retrieve single record from HDFS files.
MapReduce without metadata will scan all the data ,even if your query need to access a few
records.
Co-processor of HBase run on each RegionServer and region take reference to co-processor.
Co-processor provide region life cycle management by open, close, split, flush and perform
compact operation.
HBase : Proven To Be Scale Out
Auto Sharding.
Scale with commodity hardware.
Cost effective from GBs to PBs.
High availability through replication.
Support of Secondary Index
Often data is organized along one dimension for fast updating (such as a customer number) but
later must be looked up by other dimensions (such as zip code). Secondary indexes enable
databases to lookup data across many dimensions efficiently.
Splice machine use HBase table to store index as well as any required data.
Splice Transaction
Splice Machine is a fully transactional database. This allows you to perform actions such as commit
and rollback, which means that in a transactional context, the database does not make changes
visible to others until a commit has been issued.
Here is a simple example. Enter the following commands to see commit and rollback in action:
splice> create table a (i int);
splice> autocommit off; -- puts current shell into a transactional context
splice> insert into a values 1,2,3; -- inserted but not visible to others
splice> commit; -- now committed to the database
splice> select * from a;
splice> insert into a values 4,5;
splice> rollback; -- 4 and 5 rolled back splice> select * from a; ...
Snapshot Isolation In Transaction
Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot
of the database (in practice it reads the last committed values that existed at the time it started), and
the transaction itself will successfully commit only if no updates it has made conflict with any
concurrent updates made since that snapshot. Such a write-write conflict will cause the transaction to
abort.
Snapshot isolation is implemented within Multiversion concurrency control (MVCC)
• MVCC is a common way to increase concurrency and performance by generating a new version of a database
object each time the object is written, and allowing transactions' read operations of several last relevant
versions (of each object).
In a write skew anomaly, two transactions (T1 and T2) concurrently read an overlapping data set
(e.g. values V1 and V2), concurrently make disjoint updates (e.g. T1 updates V1, T2 updates V2), and
finally concurrently commit, neither having seen the update performed by the other. Were the system
serializable, such an anomaly would be impossible, as either T1 or T2 would have to occur "first", and
be visible to the other. In contrast, snapshot isolation permits write skew anomalies.
Example of Snapshot Isolation
Splice Machine Support Distributed
Transaction
Splice Machine has added Asynchronous Write Pipeline to HBase.
Splice Machine also have nested sub transaction to ensure region level failure does not force to
restart whole transaction.
• Example –Suppose 10TB update transaction is there that would act as single parent transaction and
when it get divided among the shards then it will become nested transaction for each shard, a failure at
a region level would typically restart only few GB instead of 10 TB.
Splice Machine Efficiency
Can it efficiently handle sparse data?
• In many large data sets, each attribute or column may be sparsely populated. In traditional databases,
an empty value must still be stored as a null, which consumes storage. Modern databases should not
require nulls for empty values.
Can you add a column without table scans?
• Data requirements change frequently and often require schema changes. Adding a column should not
require full table scans.
Splice Machine Performance
Does it support secondary indexes?
• Often data is organized along one dimension for fast updating (such as a customer number) but later
must be looked up by other dimensions (such as zip code). Secondary indexes enable databases to
lookup data across many dimensions efficiently.
Does it provide multiple join strategies?
• Joins combine data from multiple tables. With a distributed infrastructure like Hadoop that handles very
large data sets, multiple join strategies such as nested loop, sort-merge, and broadcast joins are needed
to ensure fast join performance.
Is there a cost-based optimizer?
• Performance on large data sets greatly depends on choosing the right execution strategy. Simple rules-
based optimizers are not enough. Cost-based optimizers looks for the actual cost to execute a query are
critical to optimal query performance.
Splice Machine Feature In Upcoming
Release
In many applications, certain attributes on a record may be visible to one user, but not to
another. For instance in an HR application, a CEO may get to see the salary field, while most
employees would not. Many applications control data access directly, but column level security
is an advanced database feature that enables the database to control which fields a user can
view. Splice Machine will be adding this feature in an upcoming release.
Splice Machine Overview
Splice Machine Overview

Weitere ähnliche Inhalte

Was ist angesagt?

Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Trieu Nguyen
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 

Was ist angesagt? (20)

HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Track3-3: HBase at China Life InsuranceHBaseConAsia2018 Track3-3: HBase at China Life Insurance
HBaseConAsia2018 Track3-3: HBase at China Life Insurance
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
 
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!ApacheCon 2020 - Flink SQL in 2020: Time to show off!
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
 
IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015IEEE International Conference on Data Engineering 2015
IEEE International Conference on Data Engineering 2015
 
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC timeHBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
HBaseConAsia2018 Track1-1: Use CCSMap to improve HBase YGC time
 
HBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project StatusHBaseConAsia2018 Keynote1: Apache HBase Project Status
HBaseConAsia2018 Keynote1: Apache HBase Project Status
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 

Ähnlich wie Splice Machine Overview

Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
prabakaranbrick
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
 

Ähnlich wie Splice Machine Overview (20)

How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Lecture-20.pptx
Lecture-20.pptxLecture-20.pptx
Lecture-20.pptx
 
The ABC of Big Data
The ABC of Big DataThe ABC of Big Data
The ABC of Big Data
 
Confluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern AnalyticsConfluent & Attunity: Mainframe Data Modern Analytics
Confluent & Attunity: Mainframe Data Modern Analytics
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Hive
HiveHive
Hive
 
Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy Hadoop Integration with Microstrategy
Hadoop Integration with Microstrategy
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 

Kürzlich hochgeladen

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Kürzlich hochgeladen (20)

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 

Splice Machine Overview

  • 1. Splice Machine SQL ON HADOOP RELATIONAL DATABASE MANAGEMENT SYSTEM
  • 2. How RDBMS on Hadoop is Different From Traditional RDBMS and NOSQL? NOSQL Database don’t have SQL interface, no join, no transaction on multiple rows and tables. Existing application of database need to rewrite to use NOSQL Database. Traditional RDBMS can’t automatically scale out on commodity hardware and must manually shard across server. Hadoop RDBMS eliminate the cost and scaling issue of traditional RDBMS and start supporting SQL interface over NOSQL Database. Existing application can well be migrated .
  • 3. Splice Machine Overview Splice Machine is a SQL on Hadoop RDBMS. Splice Machine provides the database technology for real-time, including these features: A. Standard ANSI SQL B. Horizontal Scale Out C. Real-Time Updates With Transaction D. Massively Parallel Architecture
  • 4. Splice Machine Becoming Real Time Many companies are experiencing an explosion of data generated by applications, websites, users, and devices such as smartphones. Companies recognize that insights contained with this data can be a source of real competitive advantage, compelling them to act quickly before those insights become obsolete. However, traditional relational databases, NoSQL alternatives, and other SQL-on-Hadoop solutions don't allow companies to collect, analyze, and react to massive amounts of data in real-time.
  • 5. Standard ANSI SQL-99 Splice Machine is an ANSI SQL-compliant database on Hadoop that enables companies to leverage existing SQL-trained resources over it.
  • 6. Horizontal Scale Out HBase support auto-sharding due to which it can have massive scalability . Traditional RDBMS would like to do scale-up that comes out to be costly as compare to commodity hardware. Splice Machine with the help of HBase to scale-out instead of scaling up to provide massive scalability across commodity hardware, even up to dozens of petabytes.
  • 7. Real Time Updates with Transaction Splice Machine supports SQL interface due to which it can perform transactions on multiple rows and tables. How it can happen in real time? Because of HBase Distributed Database over Hadoop that allow real time read/write access using HBase co-processor rather than using MapReduce (Batch Processing ). Transactional consistency is maintained by Multiple Version Control Concurrency.
  • 8. Massively Parallel Architecture Splice Machine delivers massive parallelization by placing its Parser, Planner, and Optimizer on each HBase RegionServer, which support multiple Regions and Executor on HBase Region, pushing computation down to each distributed data shard (HBase Region) . Splice Machine provide high performance by using Massive Parallel Processing for performing Predicates, Joins, Aggregation and complex query by pushing down to Data shard. For Parallelized query execution , splice machine utilizes HBase co-processor for distributed computation on data stored in Hadoop Distributed File System (HDFS).
  • 9. How Splice Machine is Different From Other SQL on Hadoop? Splice Machine is fully operational database on Hadoop that support: A. Real Time Updates B. Transaction C. Analytics D. Rich SQL Support by using ANSI SQL 99 Other SQL on Hadoop such as Hortonwork Stinger, Apache Drill, Cloudera Impala are query analytics engine that have limited SQL support, no transaction, no real time updates.
  • 11. Proven Building Blocks: HBase/HADOOP and DERBY Splice Machine marries two proven technology stacks: Apache Derby and HBase/Hadoop. A. Apache Derby: Java-Based, ANSI SQL Database • JAVA Based • ANSI SQL-99 • Lightweight 2.6 MB footprint B. Apache HBase /HDFS • Auto- sharding • Data Replication • Scalability to 100s of PB • Real Time Updates
  • 12. Apache Derby 100% Java ANSI SQL RDBMS- Client, Embedded. Java Stored Procedure Full Transaction Isolation Snapshot 2.6 MB Footprint Custom Function Authentication and Authorization Concurrency (Lock Based)
  • 13. Splice Modification To Derby Derby Component Derby Splice Machine Store Block File Based HBase Index B-Tree Dense Index in HBase Concurrency Lock Based MVCC Join Plan Centralized Hash and Nested Loop Join Sort Merge, Merge, Nested loop, Distributed Broadcast
  • 14. How Derby and HBase work together for Splice Machine? Splice Machine replace the Block file based storage engine of Apache Derby with HBase. Splice Machine uses the same Parser of Apache Derby and redesign the Planner, Optimizer and Executor so that they can work well and take advantages over Distributed HBase computation. Redesign enable splice machine database to achieve massive parallel processing by pushing computation down to each HBase region on regionserver and utilize HBase co-processor for data computation in HDFS. Client able to send SQL query to Apache derby parser then it flows to redesigned planner, optimizer and executor which resides in HBase region . Apache Derby is JAVA based so each region server try to reference local jar files of parser, planner, optimizer and each region of region server reference local jar file of executor.
  • 15. Splice SQL Processing It is similar to Apache Derby parser , splice does not redesign it. PreparedStatement ps=conn.prepareStatement(“SELECT * FROM T WHERE ID=?”); 1. Look up in cache using text match • If it is found then skip all further 5 steps • Else perform 5 steps 2. Parse using JAVACC generator(JAVA Compiler Compiler) convert it into abstract syntax tree. 3. Bind all tables associated with query. 4. Optimize plan based on cost of I/O ,Communication cost ,Disk Usage and feasible join strategies . 5. Generate of code to represent statement plan. 6. Loading of the class and creation of an instance to represent that connection's state of the query
  • 16. Distributed, Parallelized Query Execution Parallel Computation across the cluster Move computation to data shard Utilize HBase co-processor No MapReduce Query uses special operator “Exchange Operator “ for parallelism
  • 17. HBase Co-Processor Verses MapReduce For Distributed Computation On Data Stored In HDFS HBase access HDFS directly while maintain its own metadata to quickly find out single record in HDFS files . MapReduce is designed for batch data access and therefore would not appropriate for real time data access . MapReduce start Java Virtual Machine for each query , which can take up to 30 sec even to retrieve single record from HDFS files. MapReduce without metadata will scan all the data ,even if your query need to access a few records. Co-processor of HBase run on each RegionServer and region take reference to co-processor. Co-processor provide region life cycle management by open, close, split, flush and perform compact operation.
  • 18. HBase : Proven To Be Scale Out Auto Sharding. Scale with commodity hardware. Cost effective from GBs to PBs. High availability through replication.
  • 19. Support of Secondary Index Often data is organized along one dimension for fast updating (such as a customer number) but later must be looked up by other dimensions (such as zip code). Secondary indexes enable databases to lookup data across many dimensions efficiently. Splice machine use HBase table to store index as well as any required data.
  • 20. Splice Transaction Splice Machine is a fully transactional database. This allows you to perform actions such as commit and rollback, which means that in a transactional context, the database does not make changes visible to others until a commit has been issued. Here is a simple example. Enter the following commands to see commit and rollback in action: splice> create table a (i int); splice> autocommit off; -- puts current shell into a transactional context splice> insert into a values 1,2,3; -- inserted but not visible to others splice> commit; -- now committed to the database splice> select * from a; splice> insert into a values 4,5; splice> rollback; -- 4 and 5 rolled back splice> select * from a; ...
  • 21. Snapshot Isolation In Transaction Snapshot isolation is a guarantee that all reads made in a transaction will see a consistent snapshot of the database (in practice it reads the last committed values that existed at the time it started), and the transaction itself will successfully commit only if no updates it has made conflict with any concurrent updates made since that snapshot. Such a write-write conflict will cause the transaction to abort. Snapshot isolation is implemented within Multiversion concurrency control (MVCC) • MVCC is a common way to increase concurrency and performance by generating a new version of a database object each time the object is written, and allowing transactions' read operations of several last relevant versions (of each object). In a write skew anomaly, two transactions (T1 and T2) concurrently read an overlapping data set (e.g. values V1 and V2), concurrently make disjoint updates (e.g. T1 updates V1, T2 updates V2), and finally concurrently commit, neither having seen the update performed by the other. Were the system serializable, such an anomaly would be impossible, as either T1 or T2 would have to occur "first", and be visible to the other. In contrast, snapshot isolation permits write skew anomalies.
  • 22. Example of Snapshot Isolation
  • 23. Splice Machine Support Distributed Transaction Splice Machine has added Asynchronous Write Pipeline to HBase. Splice Machine also have nested sub transaction to ensure region level failure does not force to restart whole transaction. • Example –Suppose 10TB update transaction is there that would act as single parent transaction and when it get divided among the shards then it will become nested transaction for each shard, a failure at a region level would typically restart only few GB instead of 10 TB.
  • 24. Splice Machine Efficiency Can it efficiently handle sparse data? • In many large data sets, each attribute or column may be sparsely populated. In traditional databases, an empty value must still be stored as a null, which consumes storage. Modern databases should not require nulls for empty values. Can you add a column without table scans? • Data requirements change frequently and often require schema changes. Adding a column should not require full table scans.
  • 25. Splice Machine Performance Does it support secondary indexes? • Often data is organized along one dimension for fast updating (such as a customer number) but later must be looked up by other dimensions (such as zip code). Secondary indexes enable databases to lookup data across many dimensions efficiently. Does it provide multiple join strategies? • Joins combine data from multiple tables. With a distributed infrastructure like Hadoop that handles very large data sets, multiple join strategies such as nested loop, sort-merge, and broadcast joins are needed to ensure fast join performance. Is there a cost-based optimizer? • Performance on large data sets greatly depends on choosing the right execution strategy. Simple rules- based optimizers are not enough. Cost-based optimizers looks for the actual cost to execute a query are critical to optimal query performance.
  • 26. Splice Machine Feature In Upcoming Release In many applications, certain attributes on a record may be visible to one user, but not to another. For instance in an HR application, a CEO may get to see the salary field, while most employees would not. Many applications control data access directly, but column level security is an advanced database feature that enables the database to control which fields a user can view. Splice Machine will be adding this feature in an upcoming release.

Hinweis der Redaktion

  1. Asynchronous Write Pipeline allow maximum write parallelization across HBase nodes