Watch full webinar here: https://bit.ly/3oah4ng
Gartner a récemment qualifié la Data Virtualisation comme étant une pièce maitresse des architectures d’intégration de données.
Découvrez :
- Les bénéfices d’une plateforme de virtualisation de données
- La multiplication des usages : Lakehouse, Data Science, Big Data, Data Service & IoT
- La création d’une vue unifiée de votre patrimoine de données sans transiger sur la performance
- La construction d’une architecture d’intégration Agile des données : on-premise, dans le cloud ou hybride
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
Virtualisation de données : Enjeux, Usages & Bénéfices
1. Virtualisation de données
Adonis Harrouk
Keyrus: Data Engineer Big Data & Cloud
Vincent Fages-Gouyou,
Denodo: Director of Product Management EMEA
8 October 2020
Enjeux, Usages & Bénéfices
3. 3
Keyrus
International player in digital/data technologies and performance
management consultancy
MANAGEMENT & TRANSFORMATION
Developing agility and accelerating
The use of digital
INSIGHT INTO VALUE
▪ Helping enterprises in data, digital and
management since 1996
▪ Innovation & research is at the heart of
the enterprise
▪ Worldwide presence in 20 countries on 4
continents
DATA INTELLIGENCE
Master and valorize the data to bring
Information and enhance overall
performance
DIGITAL EXPERIENCE
Develop the digital experience in an
overgrowing digital society
4. 4
Accessing data from various sources
Structured dataData Sources Unstructured data
Ingestion
Data Access Data Insight
ETL/Batch/Streaming…
Data Warehouse Data Lake
Data Visualization
Reporting
Data Science
?
6. 6
Denodo
The Leader in Data Virtualization
DENODO OFFICES, CUSTOMERS, PARTNERS
Palo Alto, CA.
Global presence throughout North America,
EMEA, APAC, and Latin America.
LEADERSHIP
▪ Longest continuous focus on data
virtualization – since 1999
▪ Leader in 2018 Forrester Wave – Big
Data Fabric
▪ Winner of numerous awards
CUSTOMERS
~850 customers, including many F500 and
G2000 companies across every major industry
have gained significant business agility and ROI.
FINANCIALS
Backed by $4B+ private equity firm (HGGC)..
50+% annual growth; Profitable.
8. 8
Rising complexity of data
Rising complexity of data
▪ Eclectic mix of old and new data; every structure imaginable
▪ Generated and integrated, from batch to real time
▪ Traditional data from enterprise apps, web, third-parties
▪ New sources of data from machines, social media, IoT
Rising complexity of data management solutions
▪ Mix of home grown, vendor built, open source
▪ Multi-platform architectures; distributed and heterogeneous; on
premises or cloud; from relational to Hadoop
▪ Hybrid and diverse in the extreme.
9. 9
Ready Access to Critical Information to Support Business Processes
The Business Need
MarketingSales ExecutiveSupport
Customers
Invoices Products
Service
Usage
Access to complete information: business
entities and pre-integrated views
Access to related information: discovery
and self service
Access in real-time from different apps and
devices
10. 10
Manually access different systems
Not productive – slows down
response times
IT responds with point-to-point data
integration and replication
Data Is Siloed Across Disparate Systems
The Challenge
MarketingSales ExecutiveSupport
Database
Apps
Warehouse Cloud
Big Data
Documents AppsNo SQL
13. 13
Analytics Needs Data
Input data for a data science project may come in a variety of systems
and formats:
• Files (CSV, logs, Parquet)
• Relational databases (EDW, operational systems)
• NoSQL systems (key-value pairs, document stores, time series, etc.)
• SaaS APIs (Salesforce, Marketo, ServiceNow, Facebook, Twitter, etc.)
Data are all over the places (on premises, in the Cloud, SaaS, IoT, etc.)
In addition, the Big Data community has also embraced data science as
one of their pillars. For example Spark and SparkML, and architectural
patterns like the Data Lake
16. 16
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
DATA VIRTUALIZATION
17. 17
VizualisationML / AIData ScienceData Quality
Agile Data Hub Architecture
Data Sources
Data Warehouse
noSQL
RDBMS
18. 18
VizualisationML / AIData ScienceData Quality
Agile Data Hub Architecture
Governance, Metadata Management, Data Mart
Security
Data Access
Data Virtualization Data Services
Data Sources
Data Warehouse
noSQL
RDBMS
19. VizualisationML / AIData ScienceData Quality
Governance, Metadata Management, Data Mart
Security
Data Access
Data Virtualization Data Services
Data Sources
noSQL
RDBMS
19
Agile Data Hub Architecture
Consumers
Data Warehouse
Data Sources
Federation
Transformation
Abstraction
Data Service Dynamic Query
Optimization
Cost Based
Optimizer
Query
Rewriting
Caching MPP
Security &
Governance
Lifecycle
Management
Data Catalog
Discover
Collaborate
Query
Categorize
21. 21
Denodo’s Massive Parallel Processing
3M rows
(sales by customer)
Current Sales
290 M rows
group by
customer ID1. Partial Aggregation
push down
Maximizes source processing
Dramatically Reduces network
traffic
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or comes from a
local table already in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
Hist. Sales
300 M rows
Customer
3 M rows
join
group by name
SELECT
c.c_first_name, c.c_last_name,
SUM(ss.ss_quantity),
AVG(ss.ss_sales_price)
FROM
(SELECT * FROM current_store_sales UNION ALL
SELECT * FROM historic_store_sales) ss
JOIN sqls_customer c
ON ss.ss_customer_sk = c.c_customer_sk
GROUP BY c.c_first_name, c.c_last_name
3. On-demand data transfer
Denodo automatically generates
and upload Parquet files
In parallel
§ Customer: 3 million
§ Current sales: 290 million
§ Historic sales: 300 million
22. 22
Denodo’s Massive Parallel Processing
System Execution Time #Rows through network Optimization Techniques
Other federation systems ~ 10 min 593M Simple federation
Hadoop/MPP systems ~ 4 min 293M MPP Only
Denodo (No MPP) 43 sec 6M Aggregation push-down
Denodo (With MPP) 11 sec 6M Aggregation push-down + MPP integration (8 nodes)
SELECT
c.c_first_name, c.c_last_name,
SUM(ss.ss_quantity),
AVG(ss.ss_sales_price)
FROM
(SELECT * FROM current_store_sales UNION
ALL
SELECT * FROM historic_store_sales) ss
JOIN sqls_customer c
ON ss.ss_customer_sk = c.c_customer_sk
GROUP BY c.c_first_name, c.c_last_name
Comparing execution times of the same queries with Denodo and other federation systems.
Smaller is better
0 100 200 300 400 500 600 700
Denodo (With MPP)
Denodo (NoMPP)
Hadoop/MPP systems
Federation systems
Execution Time (seconds)
23. 23
Smart Caching for Analytics
Denodo 8 will enable the persistence of summaries to accelerate the
execution of analytical queries
§ Common joins, aggregations and filters can be precomputed (in the cache or in a
data source) and used as starting points to accelerate queries
§ Key for LDW self-service initiatives where user-driven exploration is a must
Similar to the concept of aggregation awareness using by reporting
tools (BO, MSTR) and OLAP engines
§ Integrated with Denodo’s engine query rewriting rules and CBO to provide
features not available in any other vendor for LDW scenarios
§ Denodo can provide this acceleration technology for all data sources and all
consumers
26. 26
8.0 Technical Architecture
DATA CATALOG
Discover - Explore - Document
{ API ACCESS }
RESTful / OData
GraphQL / GeoJSON
SQL
CONSUMERS
DATA VIRTUALIZATION
CONNECTIVITY
LOGICALDATAFABRICSOURCES
Traditional
DB & DW
150+
data
adapters
Cloud
Stores
Hadoop
& NoSQL
OLAP Files Apps Streaming SaaS
Query
Optimization
SecurityAI/ML Governance
Semantic
Layer
Real Time
Acceleration
Caching
DATA OPS
Deployment
Cloud PaaS
Containers/K8
On-Prem
Monitoring
Scheduling
Version Control
DEVELOPMENT
MODELING
DELIVERY
27. 27
“Denodo provides its customers with the
necessary product capabilities for
automating the data fabric design with
its core platform components – a unified
semantic catalog, a dynamic query
optimization engine and runtime
metadata-based ML algorithms. Its data
fabric design relies on data virtualization
to provide integrated data quickly to
business users to effect faster outcomes.”
2020 Gartner Magic Quadrant for Data Integration Tools
Denodo is Named a Leader
External Use
28. 28
Summary
1. Faster & more accurate decision making
§ Self-service with proper guardrails
§ Data models and catalog are two of the same
2. IT cost reduction
▪ Decouple IT from business, giving them freedom to
choose the right technology for the right problem
3. Regulations, enterprise-wide governance &
data security
▪ Controlled access all data assets in secure,
business friendly format
▪ Full audit trails
The combination of a business delivery layer with an abstraction layer in a single
platform can efficiently address those three business challenges:
30. 30
Data Virtualization use cases
From Data Storage & Management, to Data Consumers, going through Data Governance & Security
Decision
(Real time)
Single View
(Customer 360)
Agile BI
(Self-service)
Data Science
(ML & AI)
APPS
(Mobile & web)
Mergers &
Acquisitions
Data
Marketplace
Compliances
(IFRS17, GRC)
Data
Security
APIfication
(& SQLification)
Unified Data
Layer
Agility
& Simplicity
Real-time
Delivery
Data
Abstraction
Zero
Replication
Data
Governance
Sophisticated
Optimizations
Logical Data
Warehouse/Lake
Big Data
Fabric
Hybrid
Data Fabric
Data
Integration
Data
Migration
Refactoring &
Replatforming
Data Consumption
Data Storage & Management
Data Governance, Manipulation & Access
Sales
HR
Executive
Marketing Apps/API
Data Science
AI/ML