Watch full webinar here: https://bit.ly/3aePFcF
Historically data lakes have been created as a centralized physical data storage platform for data scientists to analyze data. But lately the explosion of big data, data privacy rules, departmental restrictions among many other things have made the centralized data repository approach less feasible. In this webinar, we will discuss why decentralized multipurpose data lakes are the future of data analysis for a broad range of business users.
Attend this session to learn:
- The restrictions of physical single purpose data lakes
- How to build a logical multi purpose data lake for business users
- The newer use cases that makes multi purpose data lakes a necessity
2. Logical data lakes: From single purpose to
multipurpose data lakes
Chris Day
Director, APAC Sales Engineering, Denodo
Sushant Kumar
Product Marketing Manager, Denodo
4. Logical data lakes: From single purpose
to multipurpose data lakes
4
Product Marketing Manager, Denodo
Sushant Kumar
5. 5
A data lake is a storage repository that holds a
vast amount of raw data in its native format. The data
structure and requirements are not defined until the
data is needed
The current needs for sophisticated data-
driven intelligence and data science
favored this concept for its simplicity and
power
Hadoop and its ecosystem provided the
foundation that data lakes required: vast
storage and processing muscle
It also favored the concept of ELT
vs ETL: load data first, (maybe)
Data Lakes
6. 6
The early data scientists saw Hadoop as their
personal supercomputer.
Hadoop-based Data Lakes helped
democratize access to state of the art
supercomputing with off-the- shelf HW (and
later cloud)
The industry push for BI made Hadoop–based
solutions the standard to bring modern
analytics to any corporation
Data Lakes – A Data Scientist’s Playground
7. 7
Data Lakes – Not a Perfect World
Physical Nature
• Based on Replication. Data Lakes require data to be copied to its physical storage
• Replication extends development cycles and costs
• Not all data is suitable for replication
• Real time needs: Cloud and SaaS APIs
• Large volumes: existing EDW
• Laws and restrictions
Single Purpose
• Usage of the data lake is often monopolize by data scientists
• New data silo. No clear path to share insights with business users
• Lacks the governance, security and quality that business users are used to (e.g. in the EDW)
8. 8
The Rise of Logical Architectures
The Evolution of AnalyticalArchitectures
Source: Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs Gartner April 2018
9. Multi‐purpose data lakes are data delivery environments developed to support a broad range of
users, from traditional self‐service BI users (e.g. finance, marketing, human resource, transport) to
sophisticated data scientists.
Multi‐purpose data lakes allow a broader and deeper use of the data lake investment without
minimizing the potential value for data science and without making it an inflexible environment.
Rick Van der Lans, R20 Consultancy
10. 10
The Multipurpose Data Lake with Data Virtualization
Logical Nature
• Replication is an option, not a necessity
• Broaden data access, shorten development times, better insights
• Tight integration with big data systems. Fast execution with
large data volumes
Multi-purpose
• Curated access for non-technical users
• Better governance and access control
• Better ROI for the investment of the lake
11. 11
The Multipurpose Data Lake with Data Virtualization
“Amulti-purpose data lake can become an organization’s universal data delivery system”
Architecting the Multi-Purpose Data Lake with Data Virtualization , Rick Van der Lans, April 2018
12. 12
Single access to all data assets, internal and
external:
▪ Physical Data Lake (usually based on SQL-on-Hadoop
systems)
▪ Other databases (EDW,ODS, applications, etc.)
▪ SaaS APIs (Salesforce, Google, social media, etc.)
▪ Files (local, S3, Azure, etc.)
The Virtual Data Lake – Access to all Data Sources
13. 13
The physical Data Lake can also be used as Denodo’s cache
This allows to quickly load any data accessible by Denodo to
the Hadoop cluster
Caching becomes an alternative to ingestion ELT processes
that preserves lineage and governance
Load process based on direct load to HDFS:
1. Creation of the target table in Cache system
2. Generation of Parquet files (in chunks) with Snappy
compression in the local machine
3. Upload in parallel of Parquet files to HDFS
The Virtual Data Lake – Ingesting and Caching
14. 14
Denodo optimizer provides native integration with MPP
systems to provide one extra key capability: Query
Acceleration
Denodo can move, on demand, processing to the MPP
during execution of a query
• Parallel power for calculations in the virtual
layer
• Avoids slow processing in-disk when
processing buffers don’t fit into Denodo’s
memory (swapped data)
The Virtual Data Lake – Using the Lake Processing Engine
15. 15
The Virtual Data Lake – Putting the Pieces Together
2Mrows
(sales by customer)
CurrentSales
68 M rows
1. Partial Aggregation
push down
Maximizes source processing
dramatically Reducesnetwork
traffic 3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
4. Integration with local data
The engine detects when data
is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particularoperations,
the CBO can decide to move all orpart
of the execution tree to theMPP
5. Fast parallel execution
Support for Spark, Presto and Impala
for fast analytical processing in
inexpensive Hadoop-based solutions
Hist.Sales
220 M rows
Customer
2 M rows
(Cached)
join
group by ZIP
System Execution Time Optimization Techniques
Others ~ 10 min Simple federation
No MPP 43 sec Aggregation push-down
With MPP 11 sec
Aggregation push-down + MPP integration
(Impala 8 nodes)
group by
Customer ID
16. 16
▪ A Virtual Data Lake improves decision making and shortens development
cycles
• Surfaces all company data from multiple repositories without the need to
replicate all data into the lake
• Eliminates data silos: allows for on-demand combination of data from multiple
sources
▪ A Virtual Data Lake broadens adoption of the lake and
improves its ROI
• Improves governance and metadata management to avoid “data swamps”
• Allows controlled access to the lake to non-technical users
▪ A Virtual Data Lake offer performance for the Big Data World
• Leverages the processing power of the existing cluster controlled by Denodo’s
optimizer
The Virtual Data Lake - Conclusions
17. 17
Challenges
• Competition from a low cost vendor
• Lower the price, affecting margins?
• Or, maintain high price, but differentiate in other ways?
Customer story – Large Heavy Equipment Manufacturer
18. 18
Benefits
Large Heavy Equipment Manufacturer
Self-service / Predictive Analytics – IoT Integration
Improved asset performance and
proactive maintenance
Increased revenue from sale of
services and parts
Reduced warranty costs of parts
failure
19. 19
Gartner, Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs, May 2018
When designed properly, DV can speed data integration, lower data
latency, offer flexibility and reuse, and reduce data sprawl across
dispersed data sources.
Due to its many benefits, DV is often the first step for organizations
evolving a traditional, repository- style data warehouse into a Logical
Architecture”
21. 21
Key Takeaways
FIRST
Takeaway
Hadoop-based Data Lakes are the standard approach to modern
analytics within most organizations
SECOND
Takeaway
Physical Data Lakes introduce many complexities (replication,
synchronization, governance, etc.) that restrict their use
THIRD
Takeaway
Logical Data Lakes allow users to access data from all sources –
internal and external – to grow value of Data Lake approach
FOURTH
Takeaway
Data Virtualization creates ‘multipurpose’ Data Lakes for all kinds
of users – data scientists and business users
FIFTH
Takeaway
Data Virtualization introduces governance and access controls to
the Data Lake without impeding the ‘power users'
21
23. 23
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
https://bit.ly/2AouQLQ
GET STARTED TODAY
24. Next session
Data Virtualization enabled Data Fabric:
Operationalize the data lake
Sushant Kumar
Product Marketing Manager, Denodo
Chris Day
Director, APAC Sales Engineering, Denodo