SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
High Performance ETL in a #BigData #Hadoop context
Steven Haddad – Senior Software Architect
Stéphane Heckel – Partner Manager
Hadoop User Group - September 12th 2012
Syncsort – Solving Big Data Breakpoints for 40 years
Company Track Record
• Global Software Company
• 40+ Years of Performance Innovation
• 25+ Patents related to unique and
unparalleled integration technology
Large Established Customer Base
• 16,000+ deployments
• 68 Countries
• Across all verticals
2
Expertise & Specialism
• Leading provider of high-performance
data integration solutions
• Data Integration Acceleration and Cost
Optimization
• Delivering Cost Reduction Initiatives
whilst delivering superior performance
• Typical TCO reduction of 50% - 75%
• Customer ROI within 12 months
•
DATA SERVICES
•
FINANCE
•
INSURANCE & HEALTHCARE
TRAVEL & TRANSPORT
•
RETAIL
•
TELECOMMUNICATIONS
A Fully Integrated Architecture for High-performance ETL
3
User Interface
Task Editor │ Job Editor SDK
Shared File-based
Metadata Repository
Data
Lineage
Metadata Interchange
Global
Search
Impact
Analysis
Small Footprint
ETL Engine
Self-tuning
Optimizer
Native, Direct I/O
Access
Install in Minutes. Deploy in Weeks. Never Tune Again.
High Performance Connectivity
Mainframe
Files / XML
Appliances Hadoop
Cloud
Real Time
Template-
driven Design
DMExpress Server Engine
High
Performance
Transformation
s
High
Performance
Functions
Automatic
Continuous Optimization
4
Syncsort’s Hadoop value proposition
Syncsort Value proposition on Hadoop
Syncsort Goes Beyond Basic Connectivity to Enhance
Hadoop and Facilitate Wider Adoption
 HDFS connectivity: Ability to move data in & out of
Hadoop file system
 Enhanced usability: Ability to create jobs using DMExpress
graphical user interface and run it in the Hadoop MapReduce
framework
 Contribute to the Open Source Community: Enhance
Hadoop sort framework for everyone. Make it more
modular, flexible, extensible
 Accelerate Hadoop: Address existing drawbacks in Hadoop
native sort by providing a simple, self-tuning alternative to
increase overall MapReduce performance and facilitate
ongoing development and maintenance
5
Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and
processing capabilities to optimize Hadoop environments
Extract Preprocess & Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HDFS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Time
(min)
Elapsed Processing Time
HDFS
Put DMExpress
Connect to virtually any
source
Pre-process data to cleanse,
validate, & partition for better
and faster Hadoop processing
and significant storage savings
Load data up to 6x faster!
6
DMExpress – HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
– Partition the output for parallel loading
– Makes full use of network bandwidth with
reduced elapsed time
– Hadoop/DMExpress can process wildcard
input files from HDFS
Extract HDFS
– DMExpress can read wildcard inputs in
parallel
7
Distributions supported
– Cloudera CDH3u3
– Hortonworks Data Platform 1.0.7
– Greenplum HD 1.1
DMExpress Accelerates Loading HDFS
HDFS Load
– 20 partitions
– Uncompressed input file size
from 10GB to 100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
8
3x-6x
Faster!
DMExpress Accelerates Loading HDFS
HDFS Load
– 20 partitions
– Uncompressed input file size
from 100GB to 2100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
9
6x Faster!
Enabling Storage Savings and
Accelerating Performance with DMExpress
• Load data faster into HDFS
• Store twice as much data on the cluster
• Improve overall performance by pre-sorting, cleansing and
partitioning
• Achieve higher rate of parallelism
• Realize up to 75TB of data storage savings a month
DMExpress is enabling
comScore to
32B
records
/
day
Load files Cleanse,sort,
compress,
partition.
Load to HDFS
Post-processing &
analysis
DMExpress
Node
Node
Node
Node
HDFS
Hadoop
10
11
Michael Brown, Chief
Scientist, comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
– Allow external sort to be plugged in
– Improve developer productivity
• Develop MapReduce jobs via DMExpress GUI
– Aggregations, cleansing/filtering, reformatting,
etc.
– Seamlessly accelerate MapReduce performance
• Replace Map output sorter
• Replace Reduce input sorter
https://issues.apache.org/jira/browse/MAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute
12
DMExpress Accelerates HDFS Loading
HDFS Load
– 20 partitions
– Uncompressed input file size
from 100GB to 2100GB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH4
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Write: 650MBs
– Memory: 94 GB
HDFS Load using DMExpress
13
Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster!
Accelerate Development & Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development:
Χ Lots of manual coding:
Χ MapReduce, Pig, Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition:
 No coding required
 Leverages the same skills most IT
organizations already have
 New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute
14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (i.e., Java, Pig,
Python)
DMExpress Hadoop runs native
on each data node on the cluster
– DMExpress is installed on each
data node
– Same benefits as High-performance
ETL
Issues with code generation
– Requires re-compilation with every
change
– May still require MR skills
– Ongoing issues with efficiency of
generated code
15 Sy
nc
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elapsed
Time
(sec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant
Performance Improvements
TPC-H Benchmark
– Filter & Aggregation
– GZIP compression
– Uncompressed input file size
from 100GB to 2.4TB
Cluster Specifications
– Size: 10+1+1 nodes
– Hadoop distribution: CDH3U2
– HDFS block size: 256 MB
Hardware Specifications (Per Node)
– Red Hat EL 5.8
– Intel Xeon x5670 *2
– 6 disks/node
– Read : 870MBs, Write: 660MBs
– Memory: 94 GB
TPC-H Benchmark
16
Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x
Faster than
Java; Over 2x
Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
– DMExpress partitioning allows taking advantage of
full network bandwidth
– High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
– Files, DBMS, mainframe
Ease of development (GUI vs. Java/Pig)
High performance ETL operations (MapReduce)
– Aggregation, sort, filter, copy, reformatting, join,
merge
Seamless high performance sort
18
Syncsort Confidential and Proprietary - do not copy or distribute
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLliuknag
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APImcsrivas
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.
 

Was ist angesagt? (20)

2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Cloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQLCloudera Impala + PostgreSQL
Cloudera Impala + PostgreSQL
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
MapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase APIMapR M7: Providing an enterprise quality Apache HBase API
MapR M7: Providing an enterprise quality Apache HBase API
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
 

Andere mochten auch

Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotModern Data Stack France
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Modern Data Stack France
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaModern Data Stack France
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiModern Data Stack France
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connectorDuyhai Doan
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainModern Data Stack France
 

Andere mochten auch (20)

Analyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien CabotAnalyse prédictive en assurance santé par Julien Cabot
Analyse prédictive en assurance santé par Julien Cabot
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
IBM Stream au Hadoop User Group
IBM Stream au Hadoop User GroupIBM Stream au Hadoop User Group
IBM Stream au Hadoop User Group
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Cascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand DechouxCascalog présenté par Bertrand Dechoux
Cascalog présenté par Bertrand Dechoux
 
Hadoop on Azure
Hadoop on AzureHadoop on Azure
Hadoop on Azure
 
Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)Talend Open Studio for Big Data (powered by Apache Hadoop)
Talend Open Studio for Big Data (powered by Apache Hadoop)
 
Cassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy HannaCassandra Hadoop Best Practices by Jeremy Hanna
Cassandra Hadoop Best Practices by Jeremy Hanna
 
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr KołaczkowskiCassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...
 
Cassandra spark connector
Cassandra spark connectorCassandra spark connector
Cassandra spark connector
 
Dépasser map() et reduce()
Dépasser map() et reduce()Dépasser map() et reduce()
Dépasser map() et reduce()
 
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGainHadoop HPC, calcul de VAR sur Hadoop vs GridGain
Hadoop HPC, calcul de VAR sur Hadoop vs GridGain
 
Hadoop chez Kobojo
Hadoop chez KobojoHadoop chez Kobojo
Hadoop chez Kobojo
 
Big Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent HeuschlingBig Data et SEO, par Vincent Heuschling
Big Data et SEO, par Vincent Heuschling
 
HCatalog
HCatalogHCatalog
HCatalog
 
Hadopp Vue d'ensemble
Hadopp Vue d'ensembleHadopp Vue d'ensemble
Hadopp Vue d'ensemble
 
Hadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas VialHadoop Graph Analysis par Thomas Vial
Hadoop Graph Analysis par Thomas Vial
 
Retour Hadoop Summit 2012
Retour Hadoop Summit 2012Retour Hadoop Summit 2012
Retour Hadoop Summit 2012
 

Ähnlich wie Syncsort et le retour d'expérience ComScore

Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataStéphane Heckel
 
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8Precisely
 
In Memory Parallel Processing for Big Data Scenarios
In Memory Parallel Processing for Big Data ScenariosIn Memory Parallel Processing for Big Data Scenarios
In Memory Parallel Processing for Big Data ScenariosDenodo
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Mellanox Technologies
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialRoxycodone Online
 
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Simplifying Big Data Integration with Syncsort DMX and DMX-hSimplifying Big Data Integration with Syncsort DMX and DMX-h
Simplifying Big Data Integration with Syncsort DMX and DMX-hPrecisely
 
Hadoop & distributed cloud computing
Hadoop & distributed cloud computingHadoop & distributed cloud computing
Hadoop & distributed cloud computingRajan Kumar Upadhyay
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...Precisely
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewRajan Kanitkar
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduceDerek Chen
 

Ähnlich wie Syncsort et le retour d'expérience ComScore (20)

Hug syncsort etl hadoop big data
Hug syncsort etl hadoop big dataHug syncsort etl hadoop big data
Hug syncsort etl hadoop big data
 
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
 
In Memory Parallel Processing for Big Data Scenarios
In Memory Parallel Processing for Big Data ScenariosIn Memory Parallel Processing for Big Data Scenarios
In Memory Parallel Processing for Big Data Scenarios
 
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions
 
CCD-410 Cloudera Study Material
CCD-410 Cloudera Study MaterialCCD-410 Cloudera Study Material
CCD-410 Cloudera Study Material
 
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Simplifying Big Data Integration with Syncsort DMX and DMX-hSimplifying Big Data Integration with Syncsort DMX and DMX-h
Simplifying Big Data Integration with Syncsort DMX and DMX-h
 
Hadoop & distributed cloud computing
Hadoop & distributed cloud computingHadoop & distributed cloud computing
Hadoop & distributed cloud computing
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
 
Hadoop
HadoopHadoop
Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Robin_Hadoop
Robin_HadoopRobin_Hadoop
Robin_Hadoop
 

Mehr von Modern Data Stack France

Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandationModern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France
 

Mehr von Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 

Syncsort et le retour d'expérience ComScore

  • 1. High Performance ETL in a #BigData #Hadoop context Steven Haddad – Senior Software Architect Stéphane Heckel – Partner Manager Hadoop User Group - September 12th 2012
  • 2. Syncsort – Solving Big Data Breakpoints for 40 years Company Track Record • Global Software Company • 40+ Years of Performance Innovation • 25+ Patents related to unique and unparalleled integration technology Large Established Customer Base • 16,000+ deployments • 68 Countries • Across all verticals 2 Expertise & Specialism • Leading provider of high-performance data integration solutions • Data Integration Acceleration and Cost Optimization • Delivering Cost Reduction Initiatives whilst delivering superior performance • Typical TCO reduction of 50% - 75% • Customer ROI within 12 months • DATA SERVICES • FINANCE • INSURANCE & HEALTHCARE TRAVEL & TRANSPORT • RETAIL • TELECOMMUNICATIONS
  • 3. A Fully Integrated Architecture for High-performance ETL 3 User Interface Task Editor │ Job Editor SDK Shared File-based Metadata Repository Data Lineage Metadata Interchange Global Search Impact Analysis Small Footprint ETL Engine Self-tuning Optimizer Native, Direct I/O Access Install in Minutes. Deploy in Weeks. Never Tune Again. High Performance Connectivity Mainframe Files / XML Appliances Hadoop Cloud Real Time Template- driven Design DMExpress Server Engine High Performance Transformation s High Performance Functions Automatic Continuous Optimization
  • 4. 4 Syncsort’s Hadoop value proposition Syncsort Value proposition on Hadoop
  • 5. Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption  HDFS connectivity: Ability to move data in & out of Hadoop file system  Enhanced usability: Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework  Contribute to the Open Source Community: Enhance Hadoop sort framework for everyone. Make it more modular, flexible, extensible  Accelerate Hadoop: Address existing drawbacks in Hadoop native sort by providing a simple, self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance 5 Syncsort Confidential and Proprietary - do not copy or distribute
  • 6. Optimizing Hadoop Deployments DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments Extract Preprocess & Compress Load RDBMS Appliances Cloud XML Mainframe Files Data Node Data Node Data Node Data Node HDFS Sort Aggregate Join Compress Partition 0 50 100 150 Load Time (min) Elapsed Processing Time HDFS Put DMExpress Connect to virtually any source Pre-process data to cleanse, validate, & partition for better and faster Hadoop processing and significant storage savings Load data up to 6x faster! 6
  • 7. DMExpress – HDFS Connectivity HDFS DMExpress Input Load HDFS – Partition the output for parallel loading – Makes full use of network bandwidth with reduced elapsed time – Hadoop/DMExpress can process wildcard input files from HDFS Extract HDFS – DMExpress can read wildcard inputs in parallel 7 Distributions supported – Cloudera CDH3u3 – Hortonworks Data Platform 1.0.7 – Greenplum HD 1.1
  • 8. DMExpress Accelerates Loading HDFS HDFS Load – 20 partitions – Uncompressed input file size from 10GB to 100GB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB HDFS Load using DMExpress 8 3x-6x Faster!
  • 9. DMExpress Accelerates Loading HDFS HDFS Load – 20 partitions – Uncompressed input file size from 100GB to 2100GB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB HDFS Load using DMExpress 9 6x Faster!
  • 10. Enabling Storage Savings and Accelerating Performance with DMExpress • Load data faster into HDFS • Store twice as much data on the cluster • Improve overall performance by pre-sorting, cleansing and partitioning • Achieve higher rate of parallelism • Realize up to 75TB of data storage savings a month DMExpress is enabling comScore to 32B records / day Load files Cleanse,sort, compress, partition. Load to HDFS Post-processing & analysis DMExpress Node Node Node Node HDFS Hadoop 10
  • 12. DMExpress Hadoop Integration Contribute MapReduce code changes to Apache Hadoop (JIRA MAPREDUCE-2454) – Allow external sort to be plugged in – Improve developer productivity • Develop MapReduce jobs via DMExpress GUI – Aggregations, cleansing/filtering, reformatting, etc. – Seamlessly accelerate MapReduce performance • Replace Map output sorter • Replace Reduce input sorter https://issues.apache.org/jira/browse/MAPREDUCE-2454 Syncsort Confidential and Proprietary - do not copy or distribute 12
  • 13. DMExpress Accelerates HDFS Loading HDFS Load – 20 partitions – Uncompressed input file size from 100GB to 2100GB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH4 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Write: 650MBs – Memory: 94 GB HDFS Load using DMExpress 13 Syncsort Confidential and Proprietary - do not copy or distribute 6x Faster!
  • 14. Accelerate Development & Remove Barriers to Adoption Use DMExpress to Accelerate Development and Optimize MapReduce Jobs MapReduce Development: Χ Lots of manual coding: Χ MapReduce, Pig, Java Χ Limited skills supply Χ Heavy learning curve DMExpress Hadoop Edition:  No coding required  Leverages the same skills most IT organizations already have  New resources can be trained in just 3 days Syncsort Confidential and Proprietary - do not copy or distribute 14
  • 15. Native MapReduce DMExpress Execution DMExpress Hadoop is not generating code (i.e., Java, Pig, Python) DMExpress Hadoop runs native on each data node on the cluster – DMExpress is installed on each data node – Same benefits as High-performance ETL Issues with code generation – Requires re-compilation with every change – May still require MR skills – Ongoing issues with efficiency of generated code 15 Sy nc DMX DMX DMX DMX Hadoop Cluster DMX
  • 16. 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 Elapsed Time (sec) File Size (GB) TPC-H - Aggregation Java Pig DMExpress DMExpress Hadoop Edition Provides Significant Performance Improvements TPC-H Benchmark – Filter & Aggregation – GZIP compression – Uncompressed input file size from 100GB to 2.4TB Cluster Specifications – Size: 10+1+1 nodes – Hadoop distribution: CDH3U2 – HDFS block size: 256 MB Hardware Specifications (Per Node) – Red Hat EL 5.8 – Intel Xeon x5670 *2 – 6 disks/node – Read : 870MBs, Write: 660MBs – Memory: 94 GB TPC-H Benchmark 16 Syncsort Confidential and Proprietary - do not copy or distribute Almost 2x Faster than Java; Over 2x Faster Pig
  • 18. DMExpress Hadoop Edition Benefits High performance HDFS load and extract – DMExpress partitioning allows taking advantage of full network bandwidth – High performance parallel load from HDFS to GP DB Integration with diverse set of sources – Files, DBMS, mainframe Ease of development (GUI vs. Java/Pig) High performance ETL operations (MapReduce) – Aggregation, sort, filter, copy, reformatting, join, merge Seamless high performance sort 18 Syncsort Confidential and Proprietary - do not copy or distribute