Watch the full session: Denodo DataFest 2016 sessions: https://goo.gl/y2e1Ik
For big data analytics, is it possible to achieve performance with a logical data architecture similar to that achieved with a physical architecture (i.e. all the data has been previously replicated in the same location)?
In this session, you will learn:
• Ways to exploit logical data architectures for big data analytics
• How to attain outstanding performance for very large data sets (billions of rows) through “move processing to the data” paradigm
• Design patterns from customer implementations.
This session is part of the Denodo DataFest 2016 event. You can also watch more Denodo DataFest sessions on demand here: https://goo.gl/VXb6M6
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
Denodo DataFest 2016: Logical Data Lakes: Performance Considerations
1. O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A
#DenodoDataFest
RAPID, AGILE DATA STRATEGIES
For Accelerating Analytics, Cloud, and Big Data Initiatives.
3. Agenda
1.Logical Architectures for Analytics: The Role of Data Virtualization
2.Example Scenario: The Numbers
3.Performance: The “Move Processing to the Data” Paradigm
4.Performance: Cost-based Optimization in Logical Architectures
6. 6
Logical Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
9. 9
Logical Data Lake
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
10. 10
Logical Data Lake: Data Virtualization
Real-Time
Decision
Management
Alerts
Scorecards
Dashboards
Reporting
Data Discovery
Self-Service
Search
Predictive
Analytics
Statistical
Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content
(Unstructured)
Big
Data
Enterprise
Applications
Traditional
Enterprise
Data
Cloud
Cloud
Applications
NoSQL
EDW
In-Memory
(SAP Hana, …)
Analytical
Appliances
Cloud DW
(Redshift,..)
ODS
Big Data
E
T
L
C
D
C
S
q
o
o
p
(Flume, Kafka, …)
Data Virtualization
Real-Time Data Access (On-Demand / Streaming)
Data Caching
DataServices
Data Search & Discovery
Governance
Security
Optimization
DataAbstraction
DataTransformation
DataFederation
Batch
YARN / Workload Management
HDFS
Hive
Spark
Drill
Impala
Storm HBase Solr
Hunk
DW Streams NoSQL SearchSQL
Hadoop
Tez
Map
Red.
11. 11
The Role of Data Virtualization
Combines data from several systems and publish it to the desired
format with a few clicks
Denodo Data Virtualization is the only option verifying:
12. 12
Example: Distributed Report
Demo Scenario of LDW
Sales Data (TPC-DS)
(280M)
Customer (TPC-DS)
(2M)
MDM
12
SQL on Hadoop
Total Sales by Customer
13. 13
The Role of Data Virtualization
Combines data from several systems and publish it to the desired
format with a few clicks
Abstracts applications from changes in the underlying infrastructure
Expose different logical views over the same data
Single entry point to apply Security and Governance
Denodo Data Virtualization is the only option verifying:
15. DW Historical Offloading (Cold Data Storage)
Common LDW Patterns
Time Dimension Fact table
(sales) Product Dimension
Retailer
Dimension
EDW Hadoop
Current Sales Historical Sales
15
16. DW + Cloud Dimensional Data
Common LDW Patterns
Time Dimension Fact table
(sales) Product Dimension
Customer
Dimension
EDW CRM
SFDC
Customer
16
17. Data Warehouse Federation
Common LDW Patterns
Time
Dimensi
on
Fact table
(sales)
Customer
Dimension
Region
EDW
City
EDW
Fact table
(Fidelity)
Customer
Dimension
17
19. 19
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
Customer Dim.
2 M rows
Sales Facts
290 M rows
Items Dim.
400 K rows
* TPC-DS is the de-facto industry standard benchmark for
measuring the performance of decision support solutions including,
but not limited to, Big Data systems.
vs.
Sales Facts
290 M rows
Items Dim.
400 K rows
Customer Dim.
2 M rows
Denodo 6.0 Architecture
Performance Comparison – Logical Data Warehouse vs. Physical Data Warehouse
20. 20
Denodo 6.0 Architecture
Query Description
Returned
Rows
Time Netezza
Time Denodo
(Federated Oracle,
Netezza & SQL Server)
Optimization Technique
(automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and
year between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where
sale price less than current
list price
17,05 K 3.5 sec. 5.2 sec On the fly data movement
Performance Comparison – Logical Data Warehouse vs. Physical Data Warehouse
22. 22
Move Processing to the Data
Process the data where it resides
Process the data locally where
it resides
DV System combines partial
results
Minimizes network traffic
Leverages specialized data
sources
23. 23
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Naive Strategy)
Naive Strategy:
350M rows moved through the network
24. 24
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Move Processing to the Data)
Denodo Strategy:
30k rows moved through the network
25. 25
Move Processing to the Data: Example 1 (Alternative)
Obtain Total Sales By Product (Dimension Table Replicated)
Denodo Strategy:
Dimension Table Replicated
Denodo Strategy:
20k rows moved through the network
26. 26
Move Processing to the Data: Example 2
Execution Strategy:
Full aggregation pushdown not possible
Two possible techniques:
- On-the-fly data movement
- Partial aggregation pushdown
Maximum Sales Discount By Product in The last year
Sales Discount: list_price (Product) – sale_price (Sales)
27. 27
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: On-the-fly Data Movement
Move Products Data to a Temp table in the DW :
20K rows moved through the network + 10K
rows inserted in the DW
Execute full query on the DW:
10k rows through the network
28. 28
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown
Products DB:
10K rows through the network
Data Warehouse:
#rows through the network = 10K * average
#sale_prices_per_product
30. 30
How to Choose the Best Execution Plan?
Cost-Based Optimization in Data Virtualization
Data statistics to estimate size of intermediate result sets
Data Source Indexes (and other physical structures)
Execution Model of data sources: e.g. Parallel Databases VS
Hadoop clusters VS Relational Databases
Features of data sources (e.g. number of processing cores in
parallel database or Hadoop Cluster)
Data Transfer rate
Must take into account:
31. Q&A
Find more details at: datavirtualization.blog
http://www.datavirtualizationblog.com/myths-in-data-
virtualization-performance/