2. Solved how to scale
transactions to large scale (i.e.
100 million update
transactions per second) in a
fully seamless way
Breakthrough result of
15+ years of research
by a tenacious team
Ultra-Scalable Transactions
3. Scalability & Performance
Scale out linearly from 1 to
100s of nodes
Full SQL
Simple and powerful queries
Full ACID
Always consistent
Patented Ultra-Scalable Transactional
Management
Scale to millions of update transactions
per second
Ultra-Efficient Storage Engine
Run efficiently in today’s multi-core and
NUMA hardware
4. Evaluation without data manager/logging only transactional manager
16 nodes of 12 cores, 128 GB each
2.35 Million
transactions
per second
Scalability
7. Operational Data Analytical Queries
× Costs of ETLs represent 80% of business analytics
× Analytical queries on obsolete data
Copy Process (ETL)
Current Landscape at Enterprises
Solution: Blending Two Worlds
Real Time Analytics: Analytical Queries
over Operational Data
Operational Database OLTP Data Warehouse OLAP
LeanXcale: OLTP + OLAP
8. Dealing with the Polyglot World
x Data silos: Lack of queries across SQL &
NoSQL data stores
---------------------------------------------------------
Solution: Queries across SQL & NoSQL
Queries across SQL & NoSQL data stores:
• SQL, Neo4J, MongoDB, Hbase
• Subqueries written in native query
language/API: Full power of underlying
data stores.
• Result sets of subqueries exhibited as
temporary SQL tables.
• Integration query written in simple SQL.
10. The transactional management provides ultra-scalability
Fully transparent:
• No sharding.
• No required a priori knowledge about rows to be accessed.
• Syntactically: no changes required in the application.
• Semantically: equivalent behavior to a centralized system.
Provides Snapshot Isolation
(the isolation level provided by Oracle when set to “Serializable” isolation)
+
+
Transactional Processing
12. Start End
Reads Writes
Reads & Writes
Snapshot isolation splits atomicity in two points one at the beginning of the
transaction where all reads happen and one at the end of the transaction
where all writes happen
Serializability provides a fully atomic view of a transaction, reads and writes
happen atomically at a single point in time
Snapshot Isolation VS Serializability
17. Separation of commit from the visibility of committed data
Proactive pre-assignment of commit timestamps to committing
transactions
Transactions can commit in parallel due to:
• They do not conflict
• They have their commit timestamp already assigned that will determine its
serialization order
• Visibility is regulated separately to guarantee the reading of fully consistent states
Detection and resolution of conflicts before commit
Main Principles
18. Transactional Life Cycle: Start
Snapshot
Server
The local txn mng
gets the “start TS”
from the snapshot
server.
Get start TS
Local Txn
Manager
19. Transactional Life Cycle: Execution
Local Txn
Manager
Get start TS
Run on start
TS snapshot
Conflict
Manager
The transaction will read the state
as of “start TS”.
Write-write conflicts are detected
by conflict managers on the fly.
20. Transactional Life Cycle: Commit
Get start TS
Run on start
TS snapshot
Commit
The local transaction
manager orchestrates
the commit.
Local Txn
Manager
21. Transactional Life Cycle: Commit
Data Store
Commit TS Writeset Writeset Commit TS
Local Txn
Manager
Get
Commit TS
Log
Public
Updates
Report
Snaps Serv
Commit
Sequencer
Snapshot
Server
Logger
22. Snapshot
Server
The Snapshot server keeps track
of the most recent snapshot that
is consistent:
• Its TS should such that there is no
previous commit TS that is not yet
durable and readable or it has been
discarded
• That is, it keeps the longest prefix of
used/discarded TSs such that there
are no gaps
Keeps track of
and reports most
recent consistent
TS
Gets
reports of
discarded
TSs
Gets reports
of durable &
readable TSs
In this way transactions can
commit in parallel and
consistency preserved
Transactional Life Cycle: Commit
23. Time
Sequence of timestamps received by the Snapshot Server
Evolution of the current snapshot at the Snapshot Server
11 15 12 14 13
11 11 12 12 15
Transactional Life Cycle: Commit
24. There can be as many conflict managers as needed, they scale in the
same way as hashing based key-value data stores
By doing concurrency control at conflict managers that has a much smaller
number than data managers, batching is much more effective
With TPC-C the ratio of nodes devoted to concurrency management and
query engine/region server is 20 to 1 (resulting in a 20 times more
efficient batching)
Each conflict manager takes care of a set of keys
Conflict Managers
25. Each logger takes care of a fraction of the log records
Loggers log in parallel and are uncoordinated
There can be as many loggers as needed to provide the necessary IO
bandwidth to log the rate of updates
Loggers can be replicated
If this is the case, the durability can be configured as:
•To be in the memory of a majority of logger replicas (replicated memory durability)
•To be in a persistent storage of a logger replica (1-safe durability)
•To be in a persistent storage of a majority of logger replicas (n-safe durability)
The client gets the commit reply after the writeset is durable (with respect
the configured durability)
Loggers
26. The described approach so far is the original reactive approach
It results in multiple messages per update transaction.
The adopted approach is proactive:
•The local transaction managers report periodically about the number of committed update
transactions per second
•The commit sequencer distributes batches of commit timestamps to the local transaction
managers
•The snapshot server gets periodically batches of timestamps (both used and discarded) from local
transaction managers
•The snapshot server reports periodically to local transaction managers the most current
consistent snapshot
Increasing efficiency
27. SQL processing is performed at the SQL engine tier
A SQL engine instance:
•Transforms SQL code into a query plan
•The query plan is optimized according the collected statistics (e.g. cardinality of keys)
•Orchestrate the query plan execution on top of the distributed data store
•Returns the result of the SQL execution to the client
•Maintains updated the statistics in the data store
The SQL engine has been attained by forking from Apache Derby the
query engine (same SQL dialect as DB2)
The scan operators has been modified to access KiVi instead of local
storage
The metadata is stored at KiVi instead of local storage
Increasing efficiency
29. s_id = id
σlocation = 'Rome' and color = 'red'
id = w_id
Stor
e
At the leaves of the Query
Plan there are Scan Operators that
have predicate filtering, aggregation,
grouping and sorting capabilities.
They have been rewritten to
access KiVi instead of local storage.
They enable to push down all
algebraic operators below a join.
SELECT
s.id, s.location
FROM
Store s
INNER JOIN Catalog c ON s.id=c.s_id
INNER JOIN Widget w ON c.w_id=w.id
WHERE
s.location='Rome' AND w.color='red'
SQL is translated into a query plan
represented as a tree of algebraic
operators. Algebraic operators
are written in Java plus bytecode
Store Cata
log
Wid
get
Query Engine
30. cat_id = id
location = 'Rome' and color = 'red'
Inv_id = id
color = 'red’ (Item)location = ‘Rome’ (Store)
Selections are
pushed down
Store Inven
tory
Cata
log
Selection Push Down
Select *
from Store s, Inventory I, Catalog c
where I.cat_id = c.id
and s.inv_id = i.id
and s.location = ‘Rome’
and c.color = ‘red’
Data Engine Instance 1 Data Engine Instance 2
Query Engine Instance
31. Data Engine Instance 1 Data Engine Instance 2
Aggregation Push Down
(units)
select sum(i.units)
from inventory i
Global Aggregation
Query Engine Instance
All values travel
from data engine instances
to the query engine
Inven
tory
Inven
tory
32. Data Engine Instance 1 Data Engine Instance 2
Local
Aggregation
Inven
tory
Inven
tory
Aggregation Push Down
(units) instance 1 (units) Instance 2
(units)
select sum(i.units)
from inventory i
Global Aggregation
Query Engine Instance
A single value travels
from each data engine instance
to the query engine
34. LeanXcale’s
Distributed Storage Engine
• KiVi is a full ACID and highly
efficient Relational Key-Value
datastore.
• Unlike existing Key-Value data
stores it has schema.
• Implements a novel data
structure that combines the
advantages of B+Trees for
range queries and the ones of
LSM-Trees for random updates
and inserts.
35. Disruptive Innovations
Ultra-Scalable Operational Database:
Analytical Queries over Operational Data:
Ultra-Efficient Storage Engine:
• Scales from 1 to 100s of nodes to millions of transactions per
second
• Full ACID & Full SQL
• Standard JDBC Driver
• Distributed Data Warehouse working over Operational Data.
• Real-Time Analytical Queries
• Designed to work efficiently in multi-core and many-core HW.
• Ultra-NUMA efficient.
36. KiVi has a radically new architecture:
• NUMA architecture is exploited with a shared nothing
approach.
• Very high level of efficiency is achieved thanks to
avoiding multi-threading and the associated costs
(context changes, thread synchronization).
• There are no NUMA remote accesses.
KiVi exploits the vectorial and SIMD capabilities of
current commodity server hardware, enabling to
process 10s of items with a single instruction.
KiVi is columnar as well getting columnar acceleration
for analytical queries.
Kivi Efficiency
37. Disruptive Innovations
Online Aggregations:
Continuous Dynamic Load Balancing:
Non-Intrusive Elasticity:
• Removes hotspots by computing aggregates on a transaction
without conflicts.
• Aggregate analytical queries become costless single row
queries.
• Multi-resource dynamic load balancing.
• Minimize footprint for any workload.
• Maximizes performance.
• Grows and shrinks the cluster size as needed to process the
incoming load.
• Minimizes operational costs by reducing HW resources to
actual needs (in cloud, on premise).
38. Another key feature of KiVi is its ability to enable dynamic
reconfiguration actions without stopping the processing, i.e., a
data region can be moved while transactions are updating and
reading the data
This is possible thanks to the properties of the transaction manager
that enables to move a data region in any partial state and apply
updates again at the target node with idempotence guarantees so
each update is applied exactly once.
KiVi Dynamic Reconfiguration
39. The problem of multi-resource load balancing is NP-Hard.
We have conceived a greedy algorithm that computes solutions
close to the optimal in an affordable time.
It is a novel multi-resource algorithm that considers all resources
in a way proportional to their scarcity.
KiVi Dynamic Load Balancing
40. When the average utilization of a resource (e.g. CPU) is above a
predefined threshold a new node is provisioned and the load
balancing algorithm then takes care of moving regions to balance
the load across the set of nodes.
Similarly, when the average utilization is such that a node can be
decommissioned keeping the average utilization below a
predefined threshold, then, the load of one of the nodes is
distributed to the rest of the nodes (using again the load balancing
algorithm) and the node gets decommissioned.
Kivi Elasticity
41. Disruptive Innovations
Efficient for Range Queries & Random Updates:
Costless Multi Version Concurrency Control:
Efficient High Availability:
• As efficient as B+-trees for range queries (used by relational DBs).
• As efficient as LSM-trees for random updates/inserts (used by
key-value data stores).
• Novel MVCC with almost zero overhead.
• Avoids stop-the-world obsolete version cleaning.
• Avoids resource waste for multi-versioning.
• Active-Active (Synchronous) Replication without Contention
and without Synchronization Overhead.
42. It provides active-active replication (multi-master) without
hampering performance:
• Novel replication approach avoiding the redundancy for attaining
atomicity at the transactional level and the replication level.
It will provide geo-replication without any penalty in throughput:
• Novel geo-replication algorithm that streams the logs in parallel to
the backup data center.
• The backup data center can run read-only transactions in a fully local
way.
• The backup data center can run update transactions remotely at the
primary data center.
KiVi High Availability
43. Acknowledgements
LeanXcale R&D has been and is partially supported by different funding sources including the
European Union’s FP7 and H2020 research and innovation programme under grants
BigDataStack (779747), CloudDBAppliance (732051), Vineyard (687628), CoherentPaaS
(611068), LeanBigData (619606), CrowdHealth (727560), Cybele (825355), Infinitech.
LeanXcale has been partially funded by Spanish CDTI and Spanish Ministry of Economy and Competitiveness in the NEOTEC
program under grant SNEO-20151285.