This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Syncsort et le retour d'expérience ComScore
1. High Performance ETL in a #BigData #Hadoop context
Steven Haddad – Senior Software Architect
Stéphane Heckel – Partner Manager
Hadoop User Group - September 12th 2012
2. Syncsort – Solving Big Data Breakpoints for 40 years
Company Track Record
• Global Software Company
• 40+ Years of Performance Innovation
• 25+ Patents related to unique and
unparalleled integration technology
Large Established Customer Base
• 16,000+ deployments
• 68 Countries
• Across all verticals
2
Expertise & Specialism
• Leading provider of high-performance
data integration solutions
• Data Integration Acceleration and Cost
Optimization
• Delivering Cost Reduction Initiatives
whilst delivering superior performance
• Typical TCO reduction of 50% - 75%
• Customer ROI within 12 months
•
DATA SERVICES
•
FINANCE
•
INSURANCE & HEALTHCARE
TRAVEL & TRANSPORT
•
RETAIL
•
TELECOMMUNICATIONS
3. A Fully Integrated Architecture for High-performance ETL
3
User Interface
Task Editor │ Job Editor SDK
Shared File-based
Metadata Repository
Data
Lineage
Metadata Interchange
Global
Search
Impact
Analysis
Small Footprint
ETL Engine
Self-tuning
Optimizer
Native, Direct I/O
Access
Install in Minutes. Deploy in Weeks. Never Tune Again.
High Performance Connectivity
Mainframe
Files / XML
Appliances Hadoop
Cloud
Real Time
Template-
driven Design
DMExpress Server Engine
High
Performance
Transformation
s
High
Performance
Functions
Automatic
Continuous Optimization
5. Syncsort Goes Beyond Basic Connectivity to Enhance
Hadoop and Facilitate Wider Adoption
HDFS connectivity: Ability to move data in & out of
Hadoop file system
Enhanced usability: Ability to create jobs using DMExpress
graphical user interface and run it in the Hadoop MapReduce
framework
Contribute to the Open Source Community: Enhance
Hadoop sort framework for everyone. Make it more
modular, flexible, extensible
Accelerate Hadoop: Address existing drawbacks in Hadoop
native sort by providing a simple, self-tuning alternative to
increase overall MapReduce performance and facilitate
ongoing development and maintenance
5
Syncsort Confidential and Proprietary - do not copy or distribute
6. Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and
processing capabilities to optimize Hadoop environments
Extract Preprocess & Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HDFS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Time
(min)
Elapsed Processing Time
HDFS
Put DMExpress
Connect to virtually any
source
Pre-process data to cleanse,
validate, & partition for better
and faster Hadoop processing
and significant storage savings
Load data up to 6x faster!
6
7. DMExpress – HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
– Partition the output for parallel loading
– Makes full use of network bandwidth with
reduced elapsed time
– Hadoop/DMExpress can process wildcard
input files from HDFS
Extract HDFS
– DMExpress can read wildcard inputs in
parallel
7
Distributions supported
– Cloudera CDH3u3
– Hortonworks Data Platform 1.0.7
– Greenplum HD 1.1
10. Enabling Storage Savings and
Accelerating Performance with DMExpress
• Load data faster into HDFS
• Store twice as much data on the cluster
• Improve overall performance by pre-sorting, cleansing and
partitioning
• Achieve higher rate of parallelism
• Realize up to 75TB of data storage savings a month
DMExpress is enabling
comScore to
32B
records
/
day
Load files Cleanse,sort,
compress,
partition.
Load to HDFS
Post-processing &
analysis
DMExpress
Node
Node
Node
Node
HDFS
Hadoop
10
12. DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
– Allow external sort to be plugged in
– Improve developer productivity
• Develop MapReduce jobs via DMExpress GUI
– Aggregations, cleansing/filtering, reformatting,
etc.
– Seamlessly accelerate MapReduce performance
• Replace Map output sorter
• Replace Reduce input sorter
https://issues.apache.org/jira/browse/MAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute
12
13. DMExpress Accelerates HDFS Loading
HDFS Load
– 20 partitions
– Uncompressed input file size
from 100GB to 2100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
13
Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster!
14. Accelerate Development & Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development:
Χ Lots of manual coding:
Χ MapReduce, Pig, Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition:
No coding required
Leverages the same skills most IT
organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute
14
15. Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (i.e., Java, Pig,
Python)
DMExpress Hadoop runs native
on each data node on the cluster
– DMExpress is installed on each
data node
– Same benefits as High-performance
ETL
Issues with code generation
– Requires re-compilation with every
change
– May still require MR skills
– Ongoing issues with efficiency of
generated code
15 Sy
nc
DMX DMX DMX DMX
Hadoop Cluster
DMX
16. 0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elapsed
Time
(sec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant
Performance Improvements
TPC-H Benchmark
– Filter & Aggregation
– GZIP compression
– Uncompressed input file size
from 100GB to 2.4TB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH3U2
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Read : 870MBs, Write: 660MBs
– Memory: 94 GB
TPC-H Benchmark
16
Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x
Faster than
Java; Over 2x
Faster Pig
18. DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
– DMExpress partitioning allows taking advantage of
full network bandwidth
– High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
– Files, DBMS, mainframe
Ease of development (GUI vs. Java/Pig)
High performance ETL operations (MapReduce)
– Aggregation, sort, filter, copy, reformatting, join,
merge
Seamless high performance sort
18
Syncsort Confidential and Proprietary - do not copy or distribute