SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Making Hadoop Ready for Prime Time
Hadoop Summit Amsterdam March 2013
Steve Totman
Director Of Strategy
Syncsort

March 20th 2013

Photo Credit Aaron Sikkink http://www.flickr.com/people/housequakecom/
2
Syncsort Confidential and Proprietary - do not copy or distribute

3
The Big Data Continuum
Big Data Continuum

Handcoding
nightmare

Integrating Big Data… Smarter

Hand-coding:
SQL, JCL.
Basic ETL Tools

Challenges

Min

Data
Awakening

SQL Migration

Max

Value

Advancing
Traditional
BI
Standardization &

Plateauing

Dynamic

Hitting arch limits + Early Hadoop
Heavy Platforms.
exponential costs. adoption prototyping
Demand for MF data Growing MIPS
& experimentation
Long
development
cycles

Highperformance ETL

Syncsort Confidential and Proprietary - do not copy or distribute

Unsustainable
costs

ETL & Rehosting
Optimization

Hadoop
connectivity &
sort gaps

Hadoop Sort
& Connectivity

Evolved
Big Data is the new
standard for both MF
& open systems data

Efficiency,
ETL &
skills gaps

Hadoop ETL

DMExpress
MFX
4
Mandatory sort steps in MapReduce processing

Syncsort Confidential and Proprietary - do not copy or distribute

5
Syncsort Confidential and Proprietary - do not copy or distribute

6
7
Smart Contributions to Improve Hadoop
Native Sort:

ᵡ modular
Not
ᵡ
Limited capabilities
ᵡ
Difficult to fine-tune & configure (requires

JIRA Description
4807

Allow MapOutputBuffer to be pluggable

4808

Allow Reduce-side merge to be pluggable

4809

Make classes required for 2454 public

4812

Create reduce input merger plug-in

4842

Shuffle race can hang reducer

2461

HDFS file name globbing in libhdfs

4482

Backport of 2454 to MapReduce 1 & 1.2

coding & compilation)
Native
Sort

Native
Sort

Hadoop Contribution:
Hadoop
Node
Node

 Modular
 Extensible
 Configurable through use of external sorters
on MapReduce nodes
Native
Sort

Native
Sort

Hadoop
Node

Hadoop
Node

First Included - Hadoop distribution, CDH4.2, on February 26th

…and more!!
8

Sy
nc
Benefits to the Community

MATCH

COMPRESSION
MERGE
TeraSort Benchmark
RANK
LOOKUP
Elapsed Time (min)

250
200
150
100
50

0
0

1000

2000
3000
File Size (GB)

JOIN
AGGREGRATION
Syncsort Confidential and Proprietary - do not copy or distribute

4000

5000

CDC
9
Data Access:

Mainframes

Today

Syncsort Confidential and Proprietary - do not copy or distribute

50%

Run

10
Syncsort. A Bridge to Scalable, Cost-effective Big Data
Connect

Pre-process

•HDFS Connectivity
•Mainframe
•Teradata
•Files
•RDBMS, Appliances

•Sort, Join
•Aggregate
•Compress
•Partition

Facilitate
•Graphical UI
•No Manual Coding
•No Tuning

Optimize
•Up to 6x Faster Load
•Up to 2x Faster Sort
•Faster MapReduce
Jobs
•Less Storage

Over 40 Years Solving Big Data
Challenges with Fast. Efficient. Simple.
Cost Effective DI Technology
Syncsort Confidential and Proprietary - do not copy or distribute

11
Hourly Load into comScore’s Hadoop Cluster
SyncSort’s DMExpress saves comScore over 4TB of data per day!
That’s 1460TB a year -1.42 Petabytes
500,000,000,000
450,000,000,000
400,000,000,000
350,000,000,000
300,000,000,000
250,000,000,000
200,000,000,000

150,000,000,000
100,000,000,000
50,000,000,000
1

2

3

4

5

6

7

8

9

10

Input Data in Bytes

© comScore, Inc.

Proprietary.

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Output Data in Bytes

12
comScore’s Daily Trend of Event Volume

5,000,000,000

40,000,000,000

4,000,000,000

30,000,000,000

3,000,000,000

20,000,000,000

2,000,000,000

10,000,000,000

1,000,000,000

0

# of panel records

6,000,000,000

50,000,000,000

# of census records

60,000,000,000

0

Beacon Records

Panel Records

Please Attend Mike Brown’s Session Analyzing 1.4
Trillion Events with Hadoop Tomorrow

© comScore, Inc.

Proprietary.

13
(No elephants were harmed during
the creation of this talk but some
are now a lot faster & meaner)
Please visit our booth to register for a free evaluation
Syncsort Confidential and Proprietary - do
not copy or distribute
© comScore, Inc.

Proprietary.

14

Weitere ähnliche Inhalte

Was ist angesagt?

Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
Cloudera, Inc.
 

Was ist angesagt? (20)

How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Data Process Systems, connecting everything
Data Process Systems, connecting everythingData Process Systems, connecting everything
Data Process Systems, connecting everything
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Data Orchestration for AI, Big Data, and Cloud
Data Orchestration for AI, Big Data, and CloudData Orchestration for AI, Big Data, and Cloud
Data Orchestration for AI, Big Data, and Cloud
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Hw09 Welcome To Hadoop World
Hw09   Welcome To Hadoop WorldHw09   Welcome To Hadoop World
Hw09 Welcome To Hadoop World
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructureBig data technologies and Hadoop infrastructure
Big data technologies and Hadoop infrastructure
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 

Andere mochten auch

50states jessikafrench
50states jessikafrench50states jessikafrench
50states jessikafrench
jessikafrench
 

Andere mochten auch (6)

50states jessikafrench
50states jessikafrench50states jessikafrench
50states jessikafrench
 
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
 
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data InsightSyncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
 
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept FinalSteve Totman Syncsort Big Data Warehousing hug 23 sept Final
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 

Ähnlich wie Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort Lightening Talk

Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hadoop is Happening
Hadoop is HappeningHadoop is Happening
Hadoop is Happening
Precisely
 

Ähnlich wie Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort Lightening Talk (20)

How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
 
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Simplifying Big Data Integration with Syncsort DMX and DMX-hSimplifying Big Data Integration with Syncsort DMX and DMX-h
Simplifying Big Data Integration with Syncsort DMX and DMX-h
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
Data Con LA 2018 - Populating your Enterprise Data Hub for Next Gen Analytics...
 
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
Big Data Q2 Customer Education Webcast: New DMX Change Data Capture for Hadoo...
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
IBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWERIBM Data Centric Systems & OpenPOWER
IBM Data Centric Systems & OpenPOWER
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Hadoop is Happening
Hadoop is HappeningHadoop is Happening
Hadoop is Happening
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort Lightening Talk

  • 1. Making Hadoop Ready for Prime Time Hadoop Summit Amsterdam March 2013 Steve Totman Director Of Strategy Syncsort March 20th 2013 Photo Credit Aaron Sikkink http://www.flickr.com/people/housequakecom/
  • 2. 2
  • 3. Syncsort Confidential and Proprietary - do not copy or distribute 3
  • 4. The Big Data Continuum Big Data Continuum Handcoding nightmare Integrating Big Data… Smarter Hand-coding: SQL, JCL. Basic ETL Tools Challenges Min Data Awakening SQL Migration Max Value Advancing Traditional BI Standardization & Plateauing Dynamic Hitting arch limits + Early Hadoop Heavy Platforms. exponential costs. adoption prototyping Demand for MF data Growing MIPS & experimentation Long development cycles Highperformance ETL Syncsort Confidential and Proprietary - do not copy or distribute Unsustainable costs ETL & Rehosting Optimization Hadoop connectivity & sort gaps Hadoop Sort & Connectivity Evolved Big Data is the new standard for both MF & open systems data Efficiency, ETL & skills gaps Hadoop ETL DMExpress MFX 4
  • 5. Mandatory sort steps in MapReduce processing Syncsort Confidential and Proprietary - do not copy or distribute 5
  • 6. Syncsort Confidential and Proprietary - do not copy or distribute 6
  • 7. 7
  • 8. Smart Contributions to Improve Hadoop Native Sort: ᵡ modular Not ᵡ Limited capabilities ᵡ Difficult to fine-tune & configure (requires JIRA Description 4807 Allow MapOutputBuffer to be pluggable 4808 Allow Reduce-side merge to be pluggable 4809 Make classes required for 2454 public 4812 Create reduce input merger plug-in 4842 Shuffle race can hang reducer 2461 HDFS file name globbing in libhdfs 4482 Backport of 2454 to MapReduce 1 & 1.2 coding & compilation) Native Sort Native Sort Hadoop Contribution: Hadoop Node Node  Modular  Extensible  Configurable through use of external sorters on MapReduce nodes Native Sort Native Sort Hadoop Node Hadoop Node First Included - Hadoop distribution, CDH4.2, on February 26th …and more!! 8 Sy nc
  • 9. Benefits to the Community MATCH COMPRESSION MERGE TeraSort Benchmark RANK LOOKUP Elapsed Time (min) 250 200 150 100 50 0 0 1000 2000 3000 File Size (GB) JOIN AGGREGRATION Syncsort Confidential and Proprietary - do not copy or distribute 4000 5000 CDC 9
  • 10. Data Access: Mainframes Today Syncsort Confidential and Proprietary - do not copy or distribute 50% Run 10
  • 11. Syncsort. A Bridge to Scalable, Cost-effective Big Data Connect Pre-process •HDFS Connectivity •Mainframe •Teradata •Files •RDBMS, Appliances •Sort, Join •Aggregate •Compress •Partition Facilitate •Graphical UI •No Manual Coding •No Tuning Optimize •Up to 6x Faster Load •Up to 2x Faster Sort •Faster MapReduce Jobs •Less Storage Over 40 Years Solving Big Data Challenges with Fast. Efficient. Simple. Cost Effective DI Technology Syncsort Confidential and Proprietary - do not copy or distribute 11
  • 12. Hourly Load into comScore’s Hadoop Cluster SyncSort’s DMExpress saves comScore over 4TB of data per day! That’s 1460TB a year -1.42 Petabytes 500,000,000,000 450,000,000,000 400,000,000,000 350,000,000,000 300,000,000,000 250,000,000,000 200,000,000,000 150,000,000,000 100,000,000,000 50,000,000,000 1 2 3 4 5 6 7 8 9 10 Input Data in Bytes © comScore, Inc. Proprietary. 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Output Data in Bytes 12
  • 13. comScore’s Daily Trend of Event Volume 5,000,000,000 40,000,000,000 4,000,000,000 30,000,000,000 3,000,000,000 20,000,000,000 2,000,000,000 10,000,000,000 1,000,000,000 0 # of panel records 6,000,000,000 50,000,000,000 # of census records 60,000,000,000 0 Beacon Records Panel Records Please Attend Mike Brown’s Session Analyzing 1.4 Trillion Events with Hadoop Tomorrow © comScore, Inc. Proprietary. 13
  • 14. (No elephants were harmed during the creation of this talk but some are now a lot faster & meaner) Please visit our booth to register for a free evaluation Syncsort Confidential and Proprietary - do not copy or distribute © comScore, Inc. Proprietary. 14

Hinweis der Redaktion

  1. Organizations typically struggle with data processing at all stages of the Big Data Continuum