Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)

1IBM Cloud / Data & AI / © 2019 IBM Corporation
Making the Most of Data in
Multiple Data Sources
(with Virtual Data Lakes)
Nagapriya Tiruthani
Offering Manager, IBM Db2 Big SQL
ntiruth@us.ibm.com
Prasad Pandit
Offering Management, Big Data & Db2
pprasad@us.ibm.com

Please
note
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice and at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information about
potential future products may not be incorporated into any contract.
The development, release, and timing of any future features or functionality described for our
products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in
a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the
amount of multiprogramming in the user’s job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an
individual user will achieve results similar to those stated here.
IBM Cloud / Data & AI / © 2019 IBM Corporation
2
2

“Data will be your
basis of competitive advantage”
- Ginni Rometty
Businesses are becoming more data-driven.
IBM Cloud / Data & AI / © 2019 IBM Corporation 3
Ideas. Insights. Innovation.

Data Driven Insight Driven Digital Transformation
Culture Change
Breaking Silos
Discover “what”
Understand “why”
Self Service
Reports
Business Intelligence
Cost Reduction
Modernization
Prediction
Optimization
Automation
Collaboration
Models
Visualization
Applications
New Business Models
Disruptive Technology
Real-Time Decisions
AI
Multi-Cloud
Outcomes
Capabilities
Drivers
As data becomes more ACCESSIBLE it provides more VALUE
Most are here
Competitive Market
Leader
Value from Data

IBM delivers the capabilities you
need to build a ladder to AI
IBM Cloud / © 2018 IBM Corporation
AI
Collect relevant data and make it
simple & accessible
Organize data so it can be trusted
Analyze insights on demand
Embed machine
learning everywhere
5

Live data is needed about customer behavior, buying
preferences, and sentiments to create more timely, relevant
offers
Businesses need to leverage all of their enterprise data in
order to make accurate decisions and properly analyze risk
Why Live Data?
6

Customers expect innovative and personalized digital
experiences….driven by modern databases
Personalized
and Engaging
(360* view)
Location Aware On Demand
7

Current Challenges
Data is proliferating, often stored in
different locations and formats.
It’s getting more difficult to provide
data access and analytics to the
business.
95% organizations see negative impacts from
poor data quality, resulting in wasted resources and
additional costs
40%companies are struggling to find and
retain talent to help them uncover meaningful
insights from all the data they own and collect
Most companies only analyze
12%
of the data they have

Db2 Big SQL offers a Unified Platform for Data Engineering and
Data Science
DATA
VIRTUALIZATION
Get an integrated
and live view of all
data across your
enterprise
SQL compatibility with
many SQL dialects,
enables reuse of talent
and applications
INDUSTRY LEADING
SQL ENGINE FOR
BIG DATA
ADVANCED
ANALYTICS AND ML
MADE SIMPLE
BI and data science
tools can access data
across the enterprise
for insights
FAST, INTEGRATED and SECURE DATA ACCESS LAYER

Db2 Big SQL
Data lakes Object stores NoSQL RDBMS
Modern Data Warehouse
10
10

IBM Db2 Big SQL
IBM Db2 Big SQL, an advanced SQL engine on Hadoop,
supercharges your analytical workloads on data lakes with
no vendor lock-in.
The core capabilities of Db2 Big SQL focusses on data virtualization, SQL
compatibility, scalability, performance, and of course enterprise
security/governance, making it a desirable query engine to seek insights from
disparate data sources including Hadoop
11

Here’s What Db2 Big SQL Offers
Infusing the power of
Db2 to Hadoop
Bi-directional
integration with Spark
Supports all open
source file formats – no
vendor lock-in
Innovation
Mature SQL engine for
complex SQLs
Performance at scale
Optimize resource
utilization
Performance
Reuse applications and
skills
Centralized Hive
metastore
Accelerate reports with
MQTs
Enable BI tools reporting
on Hadoop
Flexibility
Query data where it
resides with no data
movement
Virtualize siloed data
across organization
Enrich streaming data
with data at rest
Confluence
Deep analytics with a
single point of entry
Access Big Data from
notebooks via Big SQL
Machine learning using
SQL
Cognitive
Db2 Big SQL provides the stability that applications and users need
with changing platforms 12
12

No Vendor
Lock-in &
Seamless
Interoperability
with Hive
• Db2 Big SQL preserves open source foundation
• Separation of compute and storage - Data is part of Hadoop
• Alternate SQL execution engine
• Hive and Db2 Big SQL share common metadata
• Db2 Big SQL can invoke native Hive UDFs efficiently using
Parameter Style Hive
• Security and Governance can be done with Ranger/Atlas
SQL Execution Engines
Open Source Storage Model
CSV Parquet ORC
Others
…
Tab
Delim.
Hive Metastore
(open source)
Db2 Big SQL
(IBM)
Hive
(Open
Source)
13
Innovation
13

Self Tuning
Memory Manager
World Class
Cost Based Optimizer Query rewrite
Advanced Statistics
Native Row &
Columnar stores
Elastic Boost
SQL Compatibility
Hardened runtime
Advanced
Workload manager
Here’s why Db2 Big SQL can get you the best execution for complex queries and many
concurrent users with high performance
Materialized Query
Tables
Infuse the Power of Db2 in Big Data
Performance Concurrent users Complex query
Performance

Data warehouse offload to Hadoop is now made easy:
• Write one, run anywhere…
• Easy porting of applications
• Reuse skills of DBAs/ developers who know ANSI SQL
Db2 Big SQL is the best platform for
offloading
Oracle Data Marts and Warehouses
to Hadoop
Move Applications without Re-tooling
15
Flexibility
15

Virtualized
environment to
query
heterogeneous
data
Only SQL-on-Hadoop to virtualizes more than 10 different data sources: RDBMS, NoSQL,
HDFS or Object Store with efficient query rewrite and predicate pushdown
MS SQL
Server
Netezza
(PDA) Oracle PostgreSQL Teradata
DB2
LUW, Db2z, DB2 on iInformix
WebHDFS
Object Store
(S3)
Hive HBase HDFS
CDH or HDP
Db2 Big SQL
NoSQL
ML
Model
Confluence
16

Transparent
§ Appears to be one source
§ Programmers don’t need to know how
/ where data is stored
Heterogeneous
§ Accesses data from diverse sources
High Function
§ Full query support against all data
§ Capabilities of sources as well
Autonomous
§ Non-disruptive to data sources,
existing applications, systems.
High Performance
§ Optimization of distributed queries
SQLtools&applications
Datasources
Db2 Big SQL
Oracle
SQL
Server
Teradata
DB2
Others..
SAP Hana
WebHDFS
S3
What does Db2 Big SQL Federation get you?
Confluence
17

Bi-directional
integration with
Spark
• Db2 Big SQL is a self-tuning memory management SQL engine
that integrates with Spark in the platform
• Spark is a powerful analytic co-processor that complements the
rich SQL functionality of Big SQL
• Tight and scalable integration with Spark
• Enables exploiting Spark capabilities through Db2 Big SQL
• Operationalize ML models using SQL skills with Big SQL
Bi-directional
integration
allows Spark
jobs to be
executed from
Big SQL
Tight integration
with Spark
enables Big SQL
worker and
Spark Executor
to communicate
in memory
without writing
to disk
Db2
Big SQL
Share data
in memory
HDFS
Big SQL worker
Spark executor
Cognitive
18

For more details check the blog: https://developer.ibm.com/hadoop/2017/11/07/ibm-big-sql-machine-learning-demo/
Operationalize Machine Learning Models using SQL
Cognitive
19

Value Delivered to Customers
Db2
Big SQL
No vendor
lock-in
Query siloed
data across
organization
Combine
streaming
data with
data at rest
Reporting
with BI
tools
Operationali
ze ML model
with SQL
Accelerated
reporting
using MQTs
for
federated
data
Reuse
applications
& skills after
data offload
to Hadoop
• SQL compatibility with Oracle, Db2,
Netezza
• Applications / reports can be easily
ported to Hadoop
• Accelerated analytical
reports for historical data &
federated data
• Faster response for BI
• Invoke Big SQL directly from
Notebooks and make it easy for
data scientists to wrangle data
• Invoke Spark models directly from
Big SQL – make it easy for data
engineers to operationalize the
model
• BI tools (Tableau, Birst, etc.) have bad
performance when put directly on Hadoop
• Generate complex and star schema queries
• Enrich data lake with social media
data
• Add social sentiment data, click
stream data, log data or
unstructured data
• Overall provides a richer 360
degree view of customer
• Federation / Virtualization
• Avoid forced consolidation into Hadoop
• Make use of Hadoop infrastructure
• Centralized security & governance
• Separation of Compute from Hadoop storage (data is always
in Hadoop)
• Fully ANSI SQL engine – queries and skills can easily be
reused
20

Use Cases – Modernize EDW
A standardized (single) point of entry to your data lake
Db2 Big SQL
ANALYTICSDATASYSTEMS
Data
Marts
Business
Analytics
Visualization
& Dashboards
Systems of Record
RDBMS
ERP
CRM
Other
Clickstream Web & Social Geolocation Sensor
& Machine
Server
Logs
Unstructured
NEWSOURCES
HDP
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
Enterprise Data
Warehouse
Hot
CDH
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
• Access cold/archived
data on Hadoop
• Augment/enrich with
other data sources
with no data
movement
• Reuse SQL existing
skills
• Migrate applications to
Big Data from
traditional databases
• Standardize the data
environment for users
and applications with
changing platforms
Think 2019 / Session 4112 / Feb 13, 2019 / © 2019 IBM Corporation 21

Bloor Research - Db2 Big SQL is the Leader in SQL-on-Hadoop
https://www.bloorresearch.com/wp-content/uploads/2018/06/SQL-Engines-on-Hadoop-by-Philip-Howard-Bloor-Free-Download.pdf
22
22

Proven!
…
Customers: 200+
Data Scale: TBs (no architectural limit)
Cluster Size: <= 20 (BigInsights/ IOP)
<= 100 (HDP + migrations)
> 100 (new HDP customers)
Users: <= 100 (concurrent)
Big SQL 3.0 First Released April, 2014
Big SQL 4.0, 2015 (IOP)/ Big SQL 5.0 (HDP), 2017
Big SQL 6.0 (HDP 3.1), 2019
Big SQL 5.x (CDH 5.x), Q2 ‘19/ Big SQL 6.x (CDH 6.x), Q4 ‘19
-minimum 2 feature packs/ year + (on demand) cumulative fixpacks-
Source: Gartner Interactive Report, 2019
23

Customer Quotes
Big SQL’s response time for
queries is faster than others
99% of the time.
- American retail giant
Using Big SQL over HBase
enables fast key lookups to
meet end-to-end response
time of ~2secs requirement
- US national bank
High performance even
when 100s of users are
actively accessing the
datasets concurrently –
noticeably 4-5 times faster
than Hive
- US east coast bank
Modernized data warehouse
to run traditional data
warehouse analytics
(advanced SQL) using Big
SQL
- European bank
Only Big SQL could run all the
reports, unmodified on Day 1 of POC
when data was offloaded from
Oracle
- Software consulting company
Big SQL completely
CRASHED the
competition!
- Services company
Saved time and money by
reusing Netezza applications
on data offloaded to Hadoop
using Big SQL
- US utility company

Db2 Big SQL on CDH
Db2 Big SQL
Start/Stop Db2 Big SQL in
Cloudera Manager
Manage security using
Apache Sentry
25
25

Enhanced Performance for Object Store
Protocols Supported: S3
Create Tables over Data residing in Object Store directly (no copy required into Hadoop)
Once configured, Object Store tables work like any other table in Big SQL
Benefits:
– No need to copy data into Hadoop first! Query data where it resides.
– Partitioning supported!
Tradeoff:
– Expect reduced performance relative to Hive tables
CREATE HADOOP TABLE staff ( … )
LOCATION
's3a://s3atables/staff';
LOAD FROM
Object Store
also
supported!

Expand analytics to Blockchain data with
other enterprise data set (Hadoop, Spark,
RDBMs) for complex use cases without
additional work
• Empower end-users to query Blockchain data using
SQL seamlessly, like any other database
• Adopt blockchain as a source to perform
analytics/reporting while integrating with other data
stores to give 360 view of information
• Leveraging Db2 Big SQL’s federation to offer SQL
querying capabilities on Blockchain while preserving
the blockchain security model
Blockchain
Traditional
Databases
Reporting Tools
Data Access
Layer
Newer data
stores
Query Blockchain using SQL (Tech Preview)
LedgerSQL
Db2 Big SQL
NoSQL
27
27

Above all, bring stability to your applications even when the platform goes through updates......
Make Db2 Big SQL the point of entry to Big Data irrespective of which
platform has the data
Accelerate time to
market while
modernizing your
warehouse
Empower SQL users
to operationalize ML
models
Augment disparate
data for deep
analytics and AI
PERFORMANCE
SECURITY
Enable BI analytics
with high
performance and
enterprise security
Summary
With Db2 Big SQL, users can focus on what they want to do, and not worry about how it is executed..
28

Digital Transformation begins with a Single Step
Align Business & IT
Goals
Improve outcomes
Reduce costs
Empower end-users
1. Use all data
to deliver deeper insights and enhanced
data-driven decision making
2. Reduce TCO and risk
by integrating with existing investments
for greater ROI
3. Provide flexibility
in tools and applications for improved
productivity
Solution
29
29

Above all, bring stability to your applications even when the platform goes through updates......
Make Db2 Big SQL the point of entry to Big Data irrespective of which
platform has the data
Accelerate time to
market while
modernizing your
warehouse
Empower SQL users
to operationalize ML
models
Augment disparate
data for deep
analytics and AI
PERFORMANCE
SECURITY
Enable BI analytics
with high
performance and
enterprise security
Summary
With Db2 Big SQL, users can focus on what they want to do, and not worry about how it is executed..
30

DEMO
Technologies that help increase
business benefits and time to value
31
31

Anita is a data scientist at an e-commerce company PerfectGiftsRUs. She
is tasked with running retail analytics. On this Monday morning, Anita’s
boss has just asked her to look into company data to provide better
insights into customer behavior and purchases that can be used to
improve sales and inventory forecasts.
The Monday Morning Challenge

Anita’s Challenges – Performing Data Science over complete data which
are siloed
CRM + Product
Catalog
Point of SalesData Lake with
Historical Sales data

In the past, Anita’s attempts at merging information from multiple data
silos have been faced with the hassles of designing complex data
pipelines to access latest data for analysis. Disrupting overworked DBAs
is not what Anita has in mind.
What are Anita’s Options?

How IBM and Cloudera advance Client’s Analytics Journey
Big Data
Persistent Storage
Hortonworks
Data Platform
IBM
Db2 Big SQL
Big Data
Access Layer
IBM Watson Studio
(previously DSX)
Data Science and
Machine Learning IDE

CUSTOMER
CUSTOMER
ID
COUNTRY
Anita’s Solution?
PRODUCT
PRODUCT ID
PRODUCT
DESC
INVOICE
ID
PRODUCT
ID
PRODUCT
NAME
STO
CK
COD
E
PRODUCT
DESC
Customer
ID
Country Unit price
312 30001 Herb maker
basil
951 Cooking
Gear
Wrewr UK 56
100 30004 TrailChef Cup 951 Cooking
Gear
Wrewr USA 100
451 30006 TrailChef
Deluxe Cook
951 Cooking
Gear
hdfgfdg CANADA 20
INVOICE
ID
PRODU
CT ID
PRODUCT
NAME
STO
CK
COD
E
QUA
NTIT
Y
312 30001 Herb maker
basil
951 15
100 30004 TrailChef
Cup
951 1
451 30006 TrailChef
Deluxe
Cook
951 5
Db2 Big SQL provides
a unified view of the
combined data
Db2 Big SQL federates
access to the two data
sources
Watson Studio (DSX)
can run its machine
learning and analytics
on the combined data
using Db2 Big SQL

Anita saved the day. Using Big SQL’s analytics and federation
capabilities, she was able to generate accurate insights on the latest data
– just in time for the big sales meeting. Her boss was thrilled!
The Monday Afternoon Save

© 2018 International Business Machines Corporation. No part of this
document may be reproduced or transmitted in any form without
written permission from IBM.
U.S. Government Users Restricted Rights — use, duplication or
disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to
products that have not yet been announced by IBM) has been reviewed
for accuracy as of the date of initial publication and could include
unintentional technical or typographical errors. IBM shall have no
responsibility to update this information. This document is distributed
“as is” without any warranty, either express or implied. In no event,
shall IBM be liable for any damage arising from the use of this
information, including but not limited to, loss of data, business
interruption, loss of profit or loss of opportunity. IBM products and
services are warranted per the terms and conditions of the agreements
under which they are provided.
IBM products are manufactured from new parts or new and used parts.
In some cases, a product may not be new and may have been previously
installed. Regardless, our warranty terms apply.”
Any statements regarding IBM's future direction, intent or product
plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a
controlled, isolated environments. Customer examples are presented as
illustrations of how those
customers have used IBM products and the results they may have
achieved. Actual performance, cost, savings or other results in other
operating environments may vary.
References in this document to IBM products, programs, or services does
not imply that IBM intends to make such products, programs or services
available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared
by independent session speakers, and do not necessarily reflect the views
of IBM. All materials and discussions are provided for informational
purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific
situation.
It is the customer’s responsibility to insure its own compliance with legal
requirements and to obtain advice of competent legal counsel as to
the identification and interpretation of any relevant laws and regulatory
requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not
provide legal advice or represent or warrant that its services or products
will ensure that the customer follows any law.
Notices and disclaimers
38

Information concerning non-IBM products was
obtained from the suppliers of those products, their
published announcements or other publicly
available sources. IBM has not tested
those products about this publication and cannot
confirm the accuracy of performance, compatibility
or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of
those products. IBM does not warrant the quality of
any third-party products, or the ability of any such
third-party products to interoperate with IBM’s
products. IBM expressly disclaims all warranties,
expressed or implied, including but not limited to,
the implied warranties of merchantability and
fitness for a purpose.
The provision of the information contained herein is
not intended to, and does not, grant any right or
license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com and [names of other referenced IBM products
and services used in the presentation] are trademarks of International
Business Machines Corporation, registered in many jurisdictions
worldwide. Other product and service names might be trademarks of IBM
or other companies. A current list of IBM trademarks is available on
the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
.
Notices and disclaimers continued
39

Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)

Ähnlich wie Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes) (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Making the Most of Data in Multiple Data Sources (with Virtual Data Lakes)