Offload the Data Warehouse in the Age of Hadoop

Santosh Chitakki, Vice President, Products at Appfluent
schitakki@appfluent.com
Steve Totman, Director of Strategy at Syncsort
@steventotman, stotman@syncsort.com

Presentation + Demo

The Data Warehouse Vision: A Single Version of The Truth

Data
Mart

Oracle

File
XML
ERP

Mainframe

Real-Time

ETL

ETL
Enterprise
Data

Data
Mart

Warehouse

Data
Mart

2

The Data Warehouse Reality:
• Small, sample of structured data, somewhat available
• Takes months to make any changes/additions
• Costs millions every year

Data
Mart

Oracle

File
XML
ERP

ETL

ETL
Enterprise
Data
ELT

New
Reports
Data
Mart

Warehouse

Mainframe

Dead Data

SLA’s
Data
Mart

Real-Time

New
Column

Granular
History

3

ELT Processing Is Driving Exponential Database Costs
The True Cost of ELT
Queries
(Analytics)

$$$







Manual coding/scripting costs
Ongoing manual tuning costs
Higher storage costs
Hurts query performance
Hinders business agility

Transformations
(ELT)

And What if…?
• Batch window is delayed or needs to be re-run?
• Demands increase causing more overlap between queries & batch window?
• A critical business requirement results in longer/heavier queries?
4

Dormant Data Makes the Problem Even Worse
Hot

Warm

Cold Data

Transformations (ELT) of unused data
Storage capacity for dormant data

 Majority of data in data warehouse is unused/dormant
 ETL/ELT processes for unused data unnecessarily consuming CPU capacity
 Dormant data consuming unnecessary storage capacity
 Eliminate batch loads not needed
 Load and store unused data – for active archival in Hadoop
5

The Impact of ELT & Dormant Data

Missing
SLA’s
Data
Retention
Windows
Lack of
Agility
Constant
Upgrades

• Slow response times
• With 40-60% of capacity used for ELT less
resources and storage available for end user
reports.
• Only Freshest Data is stored “on-line”
• Historical data archived (as low as 3 months)
• Granularity is lost Hot / Warm / Cold / Dead
• 6 months (average) to add a new data
source / column & generate a new report
• Best resources on SQL tuning not new SQL
creation.
• Data volume growth absorbs all resources to
keep existing analysis running / perform
upgrades
• Exploration of data a wish list item
6

Offloading The Data Warehouse to Hadoop

Before

Data Sources

ETL

Data Warehouse

ETL
ELT

After

Data Sources

ETL

Business
Intelligence

Analytic
Query &
Reporting

Data Warehouse

ETL / ELT

Syncsort Confidential and Proprietary - do not copy or distribute

Analytic Query & Reporting

7

20% of ETL Jobs Can Consume of to 80% of Resources
ETL is “T” intensive
– Sort, Join, Merge, Aggregate, Partition
Mappings start simple
– Performance demands add complexity
– business logic gets “distributed”
“Spaghetti” architecture
– Impossible to govern
– Prohibitively expensive to maintain
High Impact start with the greatest pain – focus on
the 20%

8

The Opportunity

Transform the economics of data
Cost of managing 1TB of data

$15,000 – $80,000

EDW

But there’s more…
Scalability for longer data retention
Performance SLAs
Business agility

$2000 –
$6,000

Hadoop

Why Appfluent?
Appfluent transforms the economics of Big Data and Hadoop.
We are the only company that can completely analyze how
data is used to reduce costs and optimize performance.

Appfluent Visibility

Uncover the Who, What, When,
Where, and How of your data

Data

Why Syncsort?
For 40 years we have been helping companies solve their big data
issues…even before they knew the name Big Data!

• Speed leader in Big Data Processing
• Fastest sort technology in the
market

Our customers are achieving the
impossible, every day!

• Powering 50% of mainframes’ sort

• First-to-market, fully integrated
approach to Hadoop ETL
• A history of innovation
• 25+ Issued & Pending Patents

• Large global customer base
• 15,000+ deployments in 68 countries

Key Partners

12

Syncsort DMX-h – Enabling the Enterprise Data Hub
Blazing Performance. Iron Security. Disruptive Economics
• Access – One tool to access all your
data, even mainframe
• Offload – Migrate complex ELT
workloads to Hadoop without coding
• Accelerate – Seamlessly optimize new
& existing batch workloads in Hadoop

PLUS…

Access

Offload &
Deploy

Accelerate

• Smarter Architecture – ETL engine
runs natively within MapReduce
• Smarter Productivity – Use Case
Accelerators for common ETL tasks
• Smarter Security – Enterprise-grade
security

13

How to Offload Workload & Data

1

•
•

Identify costly transformations
Identify dormant data

2

•
•
•

Rewrite transformations in DMX-H
Identify performance opportunities
Move dormant data ELT to Hadoop

3

•

•

Run costliest transformations
Store and manage dormant data

4

•

Repeat regularly for maximum
results

1. Identify

Expensive
Transformations

Unused
Data
Cold
Historical Data

Costly
End-user Activity

• Identify expensive transformations such as
ELT to offload to Hadoop.
• Identify unused Tables to find useless
transformations loading them, move to Hadoop or
purge.

• Identify unused historical data (by date functions
used) and move loading & data to Hadoop.
• Discover costly end-user activity and re-direct
workloads to Hadoop.

Costly End-User Activity
Find relevant resource consuming end-user workloads and offload
data-sets and activity to Hadoop.
Example: Identify SAS data extracts (i.e. SAS
queries with with no Where Clause)

SAS Data Extracts Identified

Consuming 300 hours
of server time.

Identify data sets associated with data
extracts. Replicate identified data in
Hadoop and offload associated SAS
workload.
16

Expensive Transformations
Identify expensive transformations such as ELT to offload to Hadoop.

ELT process – consuming 65% of
CPU Time and 66% of I/O.

Drill on process to identify expensive
transformations to offload.

Unused Data
Identify unused Tables to move to Hadoop and offload batch
loads for unused data into Hadoop.
87% of Tables
Unused.

Largest Unused Table
(2 billion records).

Unused columns in Tables.

2. Access & Move Virtually Any Data
One Tool to Quickly and Securely Move All Your Data,
Big or Small. No Coding, No Scripting
Connect to Any Source & Target
•
•
•

RDBMS
Mainframe
Files

•
•
•

Cloud
Appliances
XML

Extract & Load to/from Hadoop
• Extract data & load into the cluster natively from
Hadoop or execute “off-cluster” on ETL server
• Load data warehouses directly from Hadoop. No
need for temporary landing areas.
PLUS… Mainframe Connectivity
• Directly read mainframe data
• Parse & translate
• Load into HDFS

Pre-process & Compress
• Cleanse, validate, and partition for parallel
loading
• Compress for storage savings
19

3. Offload Heavy Transformations to Hadoop
Easily Replicate & Optimize Existing Workloads in Hadoop.
No Coding. No Scripting.
 Develop MapReduce ETL processes
without writing code

 Leverage existing ETL skills
 Develop and test locally in Windows.
Deploy in Hadoop
 Use Case Accelerators to fast-track
development
Sort

Join

+

Aggregate

Copy

Merge

 File-based metadata: create once, reuse many times!

Development accelerators for CDC
and other common data flows

20

Appfluent Offload Success

Large Financial Organization

Situation

• IBM DB2 Enterprise Data Warehouse (EDW) growing too quickly
• DB2 EDW upgrade/expansion too expensive
• Found cost per terabyte of Hadoop is 5x less than DB2 (fully burdened)

Solution

• Created business program called ‘Data Warehouse Modernization’
• Deployed Cloudera to extend EDW capacity
• Used Appfluent to find migration candidates to move to Hadoop

Benefits

• Capped DB2 EDW at 200TB capacity and not expanded it since
• Saved $MM that would have been spent on additional DB2
• Positioned to handle faster rates of data growth in the future

Offloading the EDW at Leading Financial Organization

Elapsed Time (m)

400

• Offload ELT processing from Teradata into
CDH using DMX-h
• Implement flexible architecture for staging
and change data capture
• Ability to pull data directly from Mainframe
• No coding. Easier to maintain & reuse
• Enable developers with a broader set of skills
to build complex ETL workflows

HiveQL

300

360 min

200

DMX-h

100

15 min

0

DMXh
HiveQL
0

4

4 Man weeks
12 Man weeks

8

12

16

Impact on Loans Application Project:
 Cut development time by 1/3
 Reduced complexity. From 140 HiveQL scripts to 12
DMX-h graphical jobs
 Eliminated need for Java user defined functions
 24x faster!

Development Effort (Weeks)
23

Three Quick Takeaways

+
1. ELT and dormant data are driving data
warehouse cost and capacity constraints
2. Offloading heavy transformations and “cold”
data to Hadoop provides fast savings at
minimum risk
3. Follow these 3 steps:
a.
b.
c.

Identify dormant data and pinpoint heavy
ELT workloads. Focus on top 20%
Access and move data to Hadoop
Deploy new workloads in Hadoop.

24

The Data Warehouse Vision: A Single Version of The
Truth

Data
Mart

Oracle

File
XML
ERP

Mainframe

Real-Time

ETL

ETL
Enterprise
Data

Data
Mart

Warehouse
Data
Mart

25

Next Steps

Sign up for a Data Warehouse Offload assessment!
http://bit.ly/DW-assessment
Our experts will help you:
 Collect critical information about your EDW environment
 Identify migration candidates & determine feasibility
 Develop an offload plan & establish business case

26

26

Thanks for Attending!

And the winner is…..

Offload the Data Warehouse in the Age of Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Offload the Data Warehouse in the Age of Hadoop

Hinweis der Redaktion