Watch the live presentation on-demand here: https://goo.gl/6RsqrA
When processing very large amounts of data at the speed of thought, performance questions raise their ugly head. Logical data warehouse architectures rival the conventional data warehouses in speed while reducing the need to extract, transform, and load the data.
Watch this Denodo DataFest 2017 session to discover:
• The perks of a logical data warehouse vs. the physical data warehouse.
• Challenging the myths of performance of a logical data warehouse.
• Denodo's dynamic query optimizer tool.
3. 3
What is a Logical Data Warehouse?
A logical data warehouse is a data system that follows the ideas
of traditional EDW (star or snowflake schemas) and includes, in
addition to one (or more) core DWs, data from external sources.
The main objectives are to improve decision making and/or cost
reduction.
4. 4
C. Assumption, Acme Corp
Data Virtualization solutions will be much slower than a
persisted approach via ETL
1. There is a large amount of data moved through the
network for each query
2. Network transfer is slow
…but is this really true?
5. 5
Challenging the Myths of Virtual Performance
Not as much data is moved as you may think!
▪ Complex queries can be solved transferring moderate data volumes when the
right techniques are applied
▪ Operational queries
▪ Predicate delegation produces small result sets
▪ Logical Data Warehouse and Big Data
▪ Denodo uses characteristics of underlying star schemas to apply query rewriting rules
that maximize delegation to specialized sources (especially heavy GROUP BY) and
minimize data movement
▪ Current networks are almost as fast as reading from disk
▪ 10GB and 100GB Ethernet are a commodity
6. 6
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Denodo has done extensive testing using queries from the standard benchmarking test TPC-DS* and the
following scenario.
• Compares the performance of a federated approach in Denodo with an MPP system where all the
data has been replicated via ETL.
Customer Dim.
2 M rows
Sales Facts
290 M rows
Items Dim.
400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support
solutions including, but not limited to, Big Data systems.
vs.
Sales Facts
290 M rows
Items Dim.
400 K rows
Customer Dim.
2 M rows
7. 7
Performance Comparison
Logical Data Warehouse vs. Physical Data Warehouse
Query Description Returned Rows Time Netezza
Time Denodo (Federated Oracle,
Netezza & SQL Server)
Optimization Technique (automatically
selected)
Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between
2000 and 2004
5.51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less
than current list price
17.05 K 3.5 sec. 5.2 sec On the fly data movement
9. 9
Performance and Optimizations in Denodo
Comparing optimizations in DV vs ETL
Although Data Virtualization is a data integration platform, architecturally
speaking it is more similar to a RDBMs
Uses relational logic
Metadata is equivalent to that of a database
Enables ad hoc querying
Key difference between ETL engines and DV:
ETL engines are optimized for static bulk movements
Fixed data flows
Data virtualization is optimized for queries
Dynamic execution plan per query
Denodo performance architecture resembles that of a RDBMS
10. 10
Performance and Optimizations in Denodo
Focused on 3 core concepts
Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
Uses statistics for cost-based query plans
Selective Materialization
Intelligent Caching of only the most relevant and often used information
Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Resource plans based on rules
12. 12
Step by Step
Metadata
Query Tree
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved in the query
Static
Optimizer
• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering,
transformation push-up, star-schema rewritings, etc.)
• Query delegation
• Data movement query plans
Cost Based
Optimizer
• Picks optimal JOIN methods and orders based on data distribution statistics, indexes,
transfer rates, etc.
Physical
Execution Plan
• Creates the calls to the underlying systems in their corresponding protocols and
dialects (SQL, MDX, WS calls, etc.)
How the Dynamic Query Optimizer Works
13. 13
How the Dynamic Query Optimizer Works
Key Optimizations for Logical Data Warehouse Scenarios
Automatic JOIN reordering
▪ Groups branches that go to the same source to maximize query delegation and reduce processing in the DV layer
▪ End users don’t need to worry about the optimal “pairing” of the tables
The Partial Aggregation push-down optimization is key in LDW scenarios. Based on PK-FK restrictions,
pushes the aggregation (for the PKs) to the DW
▪ Leverages the processing power of the DW, optimized for these aggregations
▪ Reduces significantly the data transferred through the network (from 1B to 10K)
The Cost-based Optimizer picks the right JOIN strategies based on estimations on data volumes,
existence of indexes, transfer rates, etc.
▪ Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular
databases to take into consideration the different way those systems operate (distributed data, parallel processing,
different aggregation techniques, etc.)
14. 14
How the Dynamic Query Optimizer Works
Other relevant optimization techniques for LDW and Big Data
Automatic Data Movement
▪ Creation of temp tables in one of the systems to enable complete delegation
▪ Only considered as an option if the target source has the “data movement” option enabled
▪ Use of native bulk load APIs for better performance
Execution Alternatives
▪ If a view exists in more than one system, Denodo can decide in execution time which one to use
▪ The goal is to maximize query delegation depending on the other tables involved in the query
15. 15
How the Dynamic Query Optimizer Works
Other relevant optimization techniques for LDW and Big Data
Optimizations for Virtual Partitioning
Eliminates unnecessary queries and processing based on a pre-execution analysis of
the views and the queries
▪ Pruning of unnecessary JOIN branches
▪ Pruning of unnecessary UNION branches
▪ Push down of JOIN under UNION views
▪ Automatic Data movement for partition scenarios
17. 17
Caching
Real time vs. caching
Sometimes, real time access & federation not a good fit:
▪ Sources are slow (ex. text files, cloud apps. like Salesforce.com)
▪ A lot of data processing needed (ex. complex combinations, transformations, matching,
cleansing, etc.)
▪ Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in the cache
18. 18
Caching
Overview
Denodo’s cache system is based on an external relational database
▪ Traditional (Oracle, SQLServer, DB2, MySQL, etc.)
▪ MPP (Teradata, Netezza, Vertica, Redshift, etc.)
▪ In-memory storage (Oracle TimesTen, SAP HANA)
Works at the view level
▪ Allows hybrid access (real-time / cached) of an execution tree
Cache Control (population / maintenance)
▪ Manually – user initiated at any time
▪ Time based - using the TTL or the Denodo Scheduler
▪ Event based - e.g. using JMS messages triggered in the DB