This document discusses the challenges of delivering Hybrid Transactional and Analytical Processing (HTAP) workloads from a single database. Some key challenges include:
- Having a single query engine that can efficiently support both transactional and analytical workloads with different data structures, statistics needs, query types, etc.
- Supporting multiple different storage engines for things like transactions, analytics, and mixed workloads.
- Accommodating different data models optimized for transactions versus analytics within a single data model.
- Providing enterprise capabilities like high availability, security, and manageability for hybrid transactional and analytical systems.
Precise and Complete Requirements? An Elusive Goal
In Search of Database Nirvana: Challenges of Delivering HTAP
1. In search of database nirvana
The challenges of delivering Hybrid Transactional and Analytical Processing
Rohit Jain, CTO
rohit.jain@esgyn.com
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
2. Agenda
The swinging database pendulum
Hybrid Transaction and Analytical Processing (HTAP) Workloads
Query versus storage engines
The challenges of HTAP
◦ Single query engine for all workloads
◦ Supporting multiple storage engines
◦ Same data model for all workloads
◦ Enterprise-caliber capabilities
Conclusion
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
3. The swinging database pendulum
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
RDBMS NoSQL
• TCO
• Elastic scalability
• High performance
• Semi-structured & unstructured data
• Parallelization of user code
• Schema flexibility
• Modest needs
Polyglot programming & persistence
• graph database
• document stores
• text search
• column stores
• key value stores
• wide column stores
• Too many languages, interfaces, APIs,
& data structures
• Too much of gluing technologies together
• Compatibility between different versions
• No end-to-end view of workload performance
• Support contracts with multiple vendors
• Too many skills required to develop and manage
• Too much data movement
• No single solution for varied interfaces & use cases
SQL
• Skills prevalent
• Existing tools & applications
• Transaction support useful
• More efficient when joins needed
• Easier than coding M/R
• Merit in rigor of pre-defining columns
• Uniform metadata across applications
4. Hybrid Transaction and Analytical
Processing (HTAP) Workloads
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
OLTP
• Mostly transactional
• Sub-second response
• Customer experience
• Large update volume
• High concurrency
• Scales linearly
• Normalized data model
• Custom applications or
3rd party solutions
• Mostly SMP; MPP for
web-scale
• Keyed updates/queries
ODS
• Can be transactional
• Sub-second to seconds
• Customer experience or
Business internal
• Batch to streaming feeds
from OLTP
• Low update volume
• Low concurrency if
internal, high otherwise
• Near linear scale
• Historical data
• Normalized data model
• Custom apps / 3rd party
• Keyed queries
BI
• Non-transactional
• Seconds to minutes
• Business internal
• Batch to streaming feeds
from OLTP/ODS
• No direct updates
• Low to high concurrency
• Less linear in scale
• Historical data
• Dimension data model
• BI tools – reporting &
dashboards
• Ad hoc & scheduled
queries and large extracts
Analytics
• Non-transactional
• Minutes to hours
• Business internal
• Batch/aggregates from BI
• No direct updates
• Low concurrency
• Complex queries, non-
linear scale
• Historical & big data
• Columnar store
• Analytics in database
• Analytical tools
• Ad hoc queries
Essential to operate the business To improve performance of the company
5. Query versus storage engines
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Hadoop Cluster
Switch Switch
Operational Business Intelligence Analytics
Query Engine
• Allow clients to connect & submit queries
• Distribute connections across cluster
• Compile query
• Execute query
• Return results of query to client
Storage Engine
• Storage structure
• Partitioning
• Automatic data repartitioning
• Select columns
• Select rows based on predicates
• Caching writes and reads
• Clustering by key
• Fast access paths or filtering
• Transactional support
• Replication
• Compression & Encryption
• Mixed workload support
• Bulk data ingest/extract
• Indexing
• Colocation or node locality
• Data Governance
• Security
• Disaster recovery
• Backup, Archive, Restore
• Multi-temperate data support
In-memory
Single Query Engine
6. The challenges of HTAP
Single query engine for all workloads
Data structure – key support, clustering, partitioning
Statistics
Predicates on non-leading or non-key columns
Indexes and materialized views
Degree of parallelism
Reducing the search space
Join type
Data flow and access
Mixed Workload
Feature support
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
80 minutes 2 minutes
Equal-height
histograms
7. The challenges of HTAP
Single query engine for all workloads
Data structure – key support, clustering, partitioning
Statistics
Predicates on non-leading or non-key columns
Indexes and materialized views
Degree of parallelism
Reducing the search space
Join type
Data flow and access
Mixed Workload
Feature support
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Week Item Store …
01/07/2016 1 1 …
01/07/2016 1 3 …
01/07/2016 1 5 …
01/07/2016 2 34 …
01/07/2016 3 13 …
01/07/2016 3 3 …
01/07/2016 4 2 …
01/07/2016 4 4 …
01/14/2016 1 2 …
01/14/2016 1 4 …
01/14/2016 1 5 …
01/14/2016 1 35 …
01/14/2016 3 1 …
01/14/2016 3 20 …
Where is item = 1, Stores 2 through 5?
8. The challenges of HTAP
Single query engine for all workloads
Data structure – key support, clustering, partitioning
Statistics
Predicates on non-leading or non-key columns
Indexes and materialized views
Degree of parallelism
Reducing the search space
Join type
Data flow and access
Mixed Workload
Feature support
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Serial vs parallel plans
Node 1 Node 2 Node n
Client Application
HDFS
HBase
Region 1
Filters
HDFS HDFS HDFS HDFS
Ethernet
Coprocessors
HBase
Region 2
HBase
Region 3
HBase
Region 4
HBase
Region 5
Master Master
Multi-
fragment
Master
ESP ESP ESP ESP ESP
ESP ESP ESP ESP ESP
9. The challenges of HTAP
Single query engine for all workloads
Data structure – key support, clustering, partitioning
Statistics
Predicates on non-leading or non-key columns
Indexes and materialized views
Degree of parallelism
Reducing the search space
Join type
Data flow and access
Mixed Workload
Feature support
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Qry1
Qry2Qry4
Qry3Qry5 Qry6
Qry7
10. The challenges of HTAP
Single query engine for all workloads
Data structure – key support, clustering, partitioning
Statistics
Predicates on non-leading or non-key columns
Indexes and materialized views
Degree of parallelism
Reducing the search space
Join type
Data flow and access
Mixed Workload
Feature support
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Adaptive and parallel joins
• Nested join
• Probe cache for nested join
• Merge join
• Matching partition join
• Repartitioned hash join
• Replication by broadcast hash join
• Inner / outer child broadcast
• Dimensional schema star join
• Inner join
• Left Join
• Right Join
• Full Outer Join
• Self join
Cost Premiums for nested joins or
serial plans
11. The challenges of HTAP
Single query engine for all workloads
Data structure – key support, clustering, partitioning
Statistics
Predicates on non-leading or non-key columns
Indexes and materialized views
Degree of parallelism
Reducing the search space
Join type
Data flow and access
Mixed Workload
Feature support
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Compute
Cost
Execution
Environment
Physical
Properties
Estimates
Confidence
Cardinality,
Distribution,
Correlation
Sensitivity
To Estimates
Evaluate
Risk
Risk
Adjustment
Benefit
Risk
Risk Premiums
• Nested join 20%
• Merge join 10%
• Serial plan 5%
?
12. Data structure – key support, clustering, partitioning
Statistics
Predicates on non-leading or non-key columns
Indexes and materialized views
Degree of parallelism
Reducing the search space
Join type
Data flow and access
Mixed Workload
Feature support
• Priority / SLA based execution
• Allocation of resources by service level
• Decrease priority with usage increase
• Anti-starvation / switch between
queries based on priority
The challenges of HTAP
Single query engine for all workloads
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Query
Low
Query
Medium
Queue
Memstore
HBase
….
Memstore
HBase
Memstore
HBase
Queue Queue
HBase
Region 1
HBase
Region 3
HBase
Region 5
Query
High
Low Low Low
Medium MediumMedium
High HighHighLow Low Low
Medium MediumMedium
High HighHigh
13. The challenges of HTAP
Supporting multiple storage engines
Statistics
Key structure
Partitioning
Data type support
Projection and selection
Extensibility
Security enforcement
Transaction Management
Metadata support
Performance, scale, and
concurrency considerations
Error handling
Other operational aspects
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Single-Master Multiple-Masters
14. The challenges of HTAP
Same data model for all workloads
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Normal form
• 1NF
• 2NF
• 3NF
• BCNF
• 4NF
• 5NF
• 6NF
Star Schema
Snowflake Schema
Normal Form
Query engine integration with storage
engine(s) to support all these data models
15. The challenges of HTAP
Same data model for all workloads
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
NoSQL Data Models
“NoSQL Data Modeling Techniques”
by Ilya Katsov
Highly Scalable Blog
… and these!
16. The challenges of HTAP
Enterprise-caliber capabilities
High Availability
Security
Manageability
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
• Percentage of uptime 99.99% = 52.56 minutes
downtime to 99.999% = 5.26
• Online operations (data available for reads and writes)
o Upgrading the OS
o Upgrading the file system
o Upgrading the storage engine
o Upgrading the query engine
o Redistribute data to accommodate node and/or disk
expansions and contractions
o Changing table definition, e.g. data type changes,
and adding, dropping, renaming columns
o Create/drop secondary indexes
o Full and Incremental Backups
17. The challenges of HTAP
Enterprise-caliber capabilities
High Availability
Security
Manageability
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Schema Management Performance Management Monitoring Security Management BAR Management
Object Management Performance Monitoring Database Monitor User Management Backup Analysis
Graphical Object Editor Live Performance Monitoring Event Monitoring Role Management Recovery
Cross-Platform Schema Knowledge Data Repository Live Event Monitoring Account Migration Log Backup
Bottleneck Analysis Threshold Alerts Audit Report Backup Reports
SQL Management Job/Workload Analysis Health Index Alarm Archival
Query Builder Job/Workload Wizard Live Health Monitoring
Visual Difference Tool Job/Workload Management Response Times Maintenance Configuration Management
Data Management Live Job/Workload
Monitoring
Alert Center Repository Aging OS Provisioning
Data Migration OS Analysis Remote Monitoring Automated Maintenance Cluster Provisioning
SQL Profiler Capacity Capture Central Monitoring Instance Provisioning
Automated Import Capacity Trending Hardware Inventory Change Management Cloud Provisioning
Visual Explain Plans Capacity Forecast Hardware Monitoring Schema Capture Configuration Editor
Session Management Space Management Schema Compare and Synch
Lock Management Reorganization Management Troubleshooting Notifications
Process Management Query Cost Simulation Health Analysis Schema Rotation
Consistency Checks Historical Reports Problem Correlation Collaboration
Online Schema Evolution Bottleneck Tuning Automated Actions Virtual Changes
Built-In Automation Access Path Analysis
18. The challenges of HTAP
Enterprise-caliber capabilities
High Availability
Security
Manageability
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
• Operational performance by transactions per second
• Analytical performance by query
• Overhead of gathering metrics on operational and analytical workloads
• Configurable statistics collection
• Workload management by Service Level Objectives
o Based on priority and/or resource allocation
o High priority operational workloads vs analytical workloads
• End-to-end visibility of transaction and query metrics
• Metric breakdown down to the query operation
• Metrics for table access across workloads down to the partition level
• Skew or bottlenecks
• Integration with YARN
19. Conclusion
(C) Copyright 2015 Esgyn Corporation Esgyn Confidential
Pre-register for full O’Reilly report:
http://www.oreilly.com/go/dbnirvana
It ain’t easy!!
Very few products can even come close
Any guesses?