How to Suceed in Hadoop

© comScore, Inc. Proprietary.
Syncsort & MapR @ comScore
Michael Brown, CTO | July 9th, 2014

© comScore, Inc. Proprietary.© comScore, Inc. Proprietary.
The comScore Story
Analytics for a Digital World™

© comScore, Inc. Proprietary. 3
The Digital World is Complex
V0113

comScore’s Mission
Be the Leader in
Digital Media Analytics.
Measure all forms of
media—content and
advertising—at scale,
across all platforms, in
real-time, globally.

comScore Brings it Together
TabletPC/Mac TV SmartphoneGaming
V0113

comScore is a leading internet technology company that
provides Analytics for a Digital World™
NASDAQ SCOR
Clients 2,400+ Worldwide
Employees 1,200+
Headquarters Reston, Virginia, USA
Global Coverage Measurement from 172 Countries; 44 Markets Reported
Local Presence 32 Locations in 23 Countries
V0113

Providing Analytics For More Than 2,400+ Clients Globally
Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology
V0113

Census
Tags & Data Feeds
Panels
PC, iOS, Android
Survey
Non-behavioral elements
Methods
Aggregation
Dictionaries
Taxonomies
Syndicated
Data
Platform
Media Metrix
vCE
Collection Calibration Delivery
Consulting
Analysis
Models
Weighting
Projection
De-Duplication
Attribution
Turning Big Data into Powerful Insight
Client
Analytics
Platform
Digital
Analytix

Panel Heat Map

Average Records Captured per Day (2005-2009)
-
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
9/26/2005
10/26/2005
11/26/2005
12/26/2005
1/26/2006
2/26/2006
3/26/2006
4/26/2006
5/26/2006
6/26/2006
7/26/2006
8/26/2006
9/26/2006
10/26/2006
11/26/2006
12/26/2006
1/26/2007
2/26/2007
3/26/2007
4/26/2007
5/26/2007
6/26/2007
7/26/2007
8/26/2007
9/26/2007
10/26/2007
11/26/2007
12/26/2007
1/26/2008
2/26/2008
3/26/2008
4/26/2008
5/26/2008
6/26/2008
7/26/2008
8/26/2008
9/26/2008
10/26/2008
11/26/2008
12/26/2008
1/26/2009
2/26/2009
3/26/2009

CENSUS
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
Adopted by 90% of Top 100 U.S. Media Properties
PANEL
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Global PERSON
Measurement
Global DEVICE
Measurement
V0411

Beacon Heat Map

Monthly Records Collection
Billion
200 Billion
400 Billion
600 Billion
800 Billion
1,000 Billion
1,200 Billion
1,400 Billion
1,600 Billion
1,800 Billion
2,000 Billion
#ofrecords
Beacon Records
Panel Records
Total records collected in June 2014 = 1,726,563,202,649
Total records collected YTD 2014 = 10,037,131,368,475

DMX @ comScore

DMX use at comScore
Purchased our first 4 licenses in 2000!
We use DMX from Syncsort across hundreds of servers for efficient data
processing and aggregation.
We currently run over 100+ unique jobs every day.
With these jobs we process over 150 billion rows of data through DMX!
Connect
Design
Process Accelerate

Compression w/Sorting
Compress Log Files when processing large volumes of log data
Several advantages to Sorting Data First:
 Reduces the size of the data
 Improves application performance
Examples:
 1 Hour of one source of our data 2,315 GB raw (2.9 billion rows)
 Standard compression of time ordered data is 509 GB (22% of original)
 Standard compression on a sorted set is 324 GB (14% of original)
When applied to all our sources we save
 5.0 TB per day
 155 TB per month
 460 TB per quarter

Hadoop @ comScore

Why Hadoop?
• comScore built our own distributed
computing stack in 2002.
• In 2009 we decided it was better to leverage
the efforts of the Hadoop community instead
of building our own stack.
• We recognized the benefit of switching to
Hadoop which would allow for seamless
scaling of our infrastructure to meet the
needs of the business.
• Hadoop allows us to add compute, storage
and memory linearly and allows you to
process things at tremendous scale.
• Partnered with SyncSort on their Hadoop
efforts from Oct 2010
• Evaluated the beta of MapR in the fall of 2011

90 Days of Data
1,148
1,919
3,049
4,862
5,084
Trillion
1,000 Trillion
2,000 Trillion
3,000 Trillion
4,000 Trillion
5,000 Trillion
6,000 Trillion
2009 2010 2011 2012 2013 2014 2016

High Level Data Flow
Panel
Census
Custom Code +
ADW
EDW
Delivery

Our Cluster
Production Hadoop Cluster
 400+ nodes: Mix of Dell 720xd, R710 and R510 servers
 Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores)
 13,800+ total CPUs
 31.6 TB total memory
 8.2 PB total disk space
 Our distro is MapR M5 2.1.3

Leveraging Partitions from MapR

Validation Funnel & Target Effectiveness

Our growth
As our volume has grown we have the following stats:
 Over 683 billion events per month
 Daily Aggregate 1.8 billion
 160 billion aggregate records for 92 days
 146K Campaigns
 Over 50 countries
 We see 15 billion distinct cookies in a month
 We only need to output 26 million rows

Solution to reduce the shuffle
The Problem:
 Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and
job performance issues
The Idea:
 Partition and sort the data by cookie on a daily basis
 Create a custom InputFormat to merge daily partitions for monthly aggregations

Custom Input Format with Map Side Aggregation
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
A B C
A B C
Combiner Combiner Combiner
A B C

Risks for Partitioning
Data locality
 Custom InputFormat requires reading blocks of the partitioned data over the network
 This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
zero which guarantees that the data written to a volume will stay on one node
Map failures might result in long run times
 Size of the map inputs is no longer set by block size
 This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
mapper

Partitioning Summary
Benefits:
 A large portion of the aggregation can be completed in the map phase
 Applications can now take advantage of combiners
 Shuffles sizes are minimal
Results:
 Took a job from 35 hours to 3 hours with no hardware changes

DMX-h @ comScore

Reasons for comScore selecting DMX-h
Performance
• DMX-h as the pluggable sort in Hadoop allows us to increase throughput on
it’s existing platform; this reduces capital and ongoing operational
expenses
• The increase in throughput allows us to also deliver our data more quickly
to our customers. These things make the data more valuable to our clients.
Speed of Development
• The ability to quickly build out applications in the DMX-h GUI allows us to
iterate and respond quicker to the needs of the business.
• The ease of development also allows us to democratize the access to the
Hadoop platform by leveraging a point and click GUI.

Performance - DMx Pluggable Sort Testing Results
First Comparison Run on our Dev Cluster
Pig scripts and called with SyncSort plug in
GroupBy / Distinct Operations
• Counting uniques
• These have large shuffle steps which leads to more data to sort.
• Observed up to a 20% decrease in job runtime
Filter Operations
• Searching for a specific value
• Observed a 5% – 10% decrease in job runtime
• Dependent on type of filter and size of job output
40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20%
Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12

Speed of Development - POC
We took an existing process that runs in our Hadoop cluster and converted
that to DMX-h to validate the new capabilities.
The existing process:
• Written in 75 lines of Pig with 3 Java UDFs
• Developed in about 25 hours
• Processes 3.5 billion input rows per day
• Takes 35 minutes to run on a daily basis

DMXh-Process

Speed of Development - POC
The new process in DMX-h:
• Developed a new job with 13 tasks
• No Java UDF required
• Runs on the same data and in the same environment.
• Developed in 12 hours.
• Runs in 11 minutes! 1/3 of the time of the Pig & Java code.

Useful Factoids
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
Colorful, bite-sized graphical representations of the best discoveries we unearth.

Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com

© 2014 MapR Technologies 2
Today’s Presenters
Steve Wooledge
VP - Product Marketing
@swooledge
Jorge Lopez
Director - Product Marketing
@zanilli
Mike Brown
CTO

comScore

Syncsort & MapR @ comScore
• Michael Brown, CTO | July 9th, 2014

Leveraging MapR and Syncsort

Big Data is Overwhelming Traditional Systems
• Mission-critical reliability
• Transaction guarantees
• Deep security
• Real-time performance
• Backup and recovery
• Interactive SQL
• Rich analytics
• Workload management
• Data governance
• Backup and recovery
Enterprise
Data
Architecture
1TRENDTREND
ENTERPRISE
USERS
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
PRODUCTION
REQUIREMENTS
PRODUCTION
REQUIREMENTS
OUTSIDE SOURCES

Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND
JOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
2

OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
ENTERPRISE
USERS
1REALITYREALITY
• Data staging
• Archive
• Data transformation
• Data exploration
• Streaming,
interactions
Hadoop Relieves the Pressure from Enterprise Systems
2 Interoperability
1 Reliability and DR
4
Supports operations
and analytics
3 High performance
Keys for Production Success

FOUNDATION
Architecture Matters for Success2REALITYREALITY
Data protection
& security
High performance
Multi-tenancy
Operational &
Analytical Workloads
Open standards
for integration
NEW APPLICATIONS SLAs TRUSTEDINFORMATION LOWERTCO

The Power of the Open Source Community
ManagementManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
* Certification/support planned for 2014

MapR Distribution for Hadoop
ManagementManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
* Certification/support planned for 2014
• High availability
• Data protection
• Disaster recovery
• Standard file access
• Standard database
access
• Pluggable services
• Broad developer
support
• Enterprise security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive analytics,
real-time database
operations, and
support high arrival
rate data
• Ability to logically
divide a cluster to
support different use
cases, job types,
user groups, and
administrators
• 2X to 7X higher
performance
• Consistent, low
latency
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability

MapR: Best Solution for Customer Success
Top Ranked
Exponential
Growth
500+
Customers
Premier
Investors
3X3X bookings Q1 ‘13 – Q1 ‘14
80%80% of accounts expand 3X
90%90% software licenses
<1%<1% lifetime churn
>$1B>$1B in incremental revenue
generated by 1 customer

MapR and Syncsort Reference Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
BLOGS,
TWEETS,
LINK DATA
DATA MARTS DATA WAREHOUSE
MapR Data Platform
Business
Intelligence /
Visualization
MapR-DB MapR-FS
Batch
(MR, Spark, Hive, Pig,
…)
Interactive
(Impala, Drill, …)
Streaming
(Spark Streaming,
Storm…)
MAPR DISTRIBUTION FOR HADOOP

Do You Know Syncsort?
• Syncsort provides fast, secure, enterprise‐grade
software spanning “Big Iron to Big Data”
• Fastest sort technology in the market
• Powering 50% of mainframes’ sort
• A history of innovation
• 25+ issued & pending patents
• Large global customer base
• 12,000+ deployments in 80 countries and serving 87 of
the Fortune 100
• First‐to‐market, fully integrated approach to Hadoop
ETL
• Top 7 contributors to Hadoop. Based on number of
lines of code changed in 2013
Our customers are achieving the impossible, every
day!
Our customers are achieving the impossible, every
day!
Key Partners

The Hadoop Challenge
PROCESS
Sort
JoinAggregate Copy
Merge
DISTRIBUTECOLLECT
Most organizations use Hadoop to…
EExtract
TTransform
LLoad

Turning Hadoop into a Feature-rich ETL Solution
Collect
• Broad based connectivity with automated parallelism
• Best in class mainframe data access & translation
Process & Distribute
• No manual coding. GUI for developing & maintaining MR jobs
• No code generation. Engine runs natively on each node
• Develop & test locally in Windows; run natively on Hadoop
Optimize & Secure
• Faster throughput per node
• Full support for Kerberos & LDAP
• Web‐based monitoring console
• Sort‐work compression for storage savings
DMX‐h
ETL
Collect Process
& Distribute
Optimize
& Secure

A Roadmap to Hadoop Success
Agile Data
Exploration &
Visualization
Next‐gen Analytics
Cheap Storage
Offload Data
Warehouse
Enabling The
Data‐driven Organization
Solving The Intractable
IT Problem
17

MapR + Syncsort Solutions
Data Warehouse
Optimization
Click‐stream
Analysis
Mainframe Offload
Shift ELT Workloads
to Hadoop
Access, Translate & Analyze
Mainframe Data with Hadoop
Collect, Process & Analyze More
Data from Your Website

Q&AEngage with us!
1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox
2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr
3. Learn best practices for Hadoop ETL: www.mapr.com/EDH

How to Suceed in Hadoop

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How to Suceed in Hadoop

Ähnlich wie How to Suceed in Hadoop (20)

Mehr von Precisely

Mehr von Precisely (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How to Suceed in Hadoop