SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
© comScore, Inc. Proprietary.
Syncsort & MapR @ comScore
Michael Brown, CTO | July 9th, 2014
© comScore, Inc. Proprietary.© comScore, Inc. Proprietary.
The comScore Story
Analytics for a Digital World™
© comScore, Inc. Proprietary. 3
The Digital World is Complex
V0113
© comScore, Inc. Proprietary. 4
comScore’s Mission
Be the Leader in
Digital Media Analytics.
Measure all forms of
media—content and
advertising—at scale,
across all platforms, in
real-time, globally.
© comScore, Inc. Proprietary. 5
comScore Brings it Together
TabletPC/Mac TV SmartphoneGaming
V0113
© comScore, Inc. Proprietary. 6
comScore is a leading internet technology company that
provides Analytics for a Digital World™
NASDAQ SCOR
Clients 2,400+ Worldwide
Employees 1,200+
Headquarters Reston, Virginia, USA
Global Coverage Measurement from 172 Countries; 44 Markets Reported
Local Presence 32 Locations in 23 Countries
V0113
© comScore, Inc. Proprietary. 7
Providing Analytics For More Than 2,400+ Clients Globally
Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology
V0113
© comScore, Inc. Proprietary. 8
Census
Tags & Data Feeds
Panels
PC, iOS, Android
Survey
Non-behavioral elements
Methods
Aggregation
Dictionaries
Taxonomies
Syndicated
Data
Platform
Media Metrix
vCE
Collection Calibration Delivery
Consulting
Analysis
Models
Weighting
Projection
De-Duplication
Attribution
Turning Big Data into Powerful Insight
Client
Analytics
Platform
Digital
Analytix
© comScore, Inc. Proprietary. 9
© comScore, Inc. Proprietary. 10
Panel Heat Map
© comScore, Inc. Proprietary. 11
Average Records Captured per Day (2005-2009)
-
200,000,000
400,000,000
600,000,000
800,000,000
1,000,000,000
1,200,000,000
1,400,000,000
1,600,000,000
1,800,000,000
9/26/2005
10/26/2005
11/26/2005
12/26/2005
1/26/2006
2/26/2006
3/26/2006
4/26/2006
5/26/2006
6/26/2006
7/26/2006
8/26/2006
9/26/2006
10/26/2006
11/26/2006
12/26/2006
1/26/2007
2/26/2007
3/26/2007
4/26/2007
5/26/2007
6/26/2007
7/26/2007
8/26/2007
9/26/2007
10/26/2007
11/26/2007
12/26/2007
1/26/2008
2/26/2008
3/26/2008
4/26/2008
5/26/2008
6/26/2008
7/26/2008
8/26/2008
9/26/2008
10/26/2008
11/26/2008
12/26/2008
1/26/2009
2/26/2009
3/26/2009
© comScore, Inc. Proprietary. 12
CENSUS
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
Adopted by 90% of Top 100 U.S. Media Properties
PANEL
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Global PERSON
Measurement
Global DEVICE
Measurement
V0411
© comScore, Inc. Proprietary. 13
Beacon Heat Map
© comScore, Inc. Proprietary. 14
Monthly Records Collection
Billion
200 Billion
400 Billion
600 Billion
800 Billion
1,000 Billion
1,200 Billion
1,400 Billion
1,600 Billion
1,800 Billion
2,000 Billion
#ofrecords
Beacon Records
Panel Records
Total records collected in June 2014 = 1,726,563,202,649
Total records collected YTD 2014 = 10,037,131,368,475
© comScore, Inc. Proprietary.
DMX @ comScore
© comScore, Inc. Proprietary. 16
DMX use at comScore
Purchased our first 4 licenses in 2000!
We use DMX from Syncsort across hundreds of servers for efficient data
processing and aggregation.
We currently run over 100+ unique jobs every day.
With these jobs we process over 150 billion rows of data through DMX!
Connect
Design
Process Accelerate
© comScore, Inc. Proprietary. 17
Compression w/Sorting
Compress Log Files when processing large volumes of log data
Several advantages to Sorting Data First:
 Reduces the size of the data
 Improves application performance
Examples:
 1 Hour of one source of our data 2,315 GB raw (2.9 billion rows)
 Standard compression of time ordered data is 509 GB (22% of original)
 Standard compression on a sorted set is 324 GB (14% of original)
When applied to all our sources we save
 5.0 TB per day
 155 TB per month
 460 TB per quarter
© comScore, Inc. Proprietary.
Hadoop @ comScore
© comScore, Inc. Proprietary. 19
Why Hadoop?
• comScore built our own distributed
computing stack in 2002.
• In 2009 we decided it was better to leverage
the efforts of the Hadoop community instead
of building our own stack.
• We recognized the benefit of switching to
Hadoop which would allow for seamless
scaling of our infrastructure to meet the
needs of the business.
• Hadoop allows us to add compute, storage
and memory linearly and allows you to
process things at tremendous scale.
• Partnered with SyncSort on their Hadoop
efforts from Oct 2010
• Evaluated the beta of MapR in the fall of 2011
© comScore, Inc. Proprietary. 20
90 Days of Data
1,148
1,919
3,049
4,862
5,084
Trillion
1,000 Trillion
2,000 Trillion
3,000 Trillion
4,000 Trillion
5,000 Trillion
6,000 Trillion
2009 2010 2011 2012 2013 2014 2016
© comScore, Inc. Proprietary. 21
High Level Data Flow
Panel
Census
Custom Code +
ADW
EDW
Delivery
© comScore, Inc. Proprietary. 22
Our Cluster
Production Hadoop Cluster
 400+ nodes: Mix of Dell 720xd, R710 and R510 servers
 Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores)
 13,800+ total CPUs
 31.6 TB total memory
 8.2 PB total disk space
 Our distro is MapR M5 2.1.3
© comScore, Inc. Proprietary.
Leveraging Partitions from MapR
© comScore, Inc. Proprietary.
© comScore, Inc. Proprietary.
Validation Funnel & Target Effectiveness
© comScore, Inc. Proprietary. 26
Our growth
As our volume has grown we have the following stats:
 Over 683 billion events per month
 Daily Aggregate 1.8 billion
 160 billion aggregate records for 92 days
 146K Campaigns
 Over 50 countries
 We see 15 billion distinct cookies in a month
 We only need to output 26 million rows
© comScore, Inc. Proprietary. 27
Solution to reduce the shuffle
The Problem:
 Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and
job performance issues
The Idea:
 Partition and sort the data by cookie on a daily basis
 Create a custom InputFormat to merge daily partitions for monthly aggregations
© comScore, Inc. Proprietary. 28
Custom Input Format with Map Side Aggregation
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
A B C
A B C
Combiner Combiner Combiner
A B C
© comScore, Inc. Proprietary. 29
Risks for Partitioning
Data locality
 Custom InputFormat requires reading blocks of the partitioned data over the network
 This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
zero which guarantees that the data written to a volume will stay on one node
Map failures might result in long run times
 Size of the map inputs is no longer set by block size
 This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
mapper
© comScore, Inc. Proprietary. 30
Partitioning Summary
Benefits:
 A large portion of the aggregation can be completed in the map phase
 Applications can now take advantage of combiners
 Shuffles sizes are minimal
Results:
 Took a job from 35 hours to 3 hours with no hardware changes
© comScore, Inc. Proprietary.
DMX-h @ comScore
© comScore, Inc. Proprietary. 32
Reasons for comScore selecting DMX-h
Performance
• DMX-h as the pluggable sort in Hadoop allows us to increase throughput on
it’s existing platform; this reduces capital and ongoing operational
expenses
• The increase in throughput allows us to also deliver our data more quickly
to our customers. These things make the data more valuable to our clients.
Speed of Development
• The ability to quickly build out applications in the DMX-h GUI allows us to
iterate and respond quicker to the needs of the business.
• The ease of development also allows us to democratize the access to the
Hadoop platform by leveraging a point and click GUI.
© comScore, Inc. Proprietary. 33
Performance - DMx Pluggable Sort Testing Results
First Comparison Run on our Dev Cluster
Pig scripts and called with SyncSort plug in
GroupBy / Distinct Operations
• Counting uniques
• These have large shuffle steps which leads to more data to sort.
• Observed up to a 20% decrease in job runtime
Filter Operations
• Searching for a specific value
• Observed a 5% – 10% decrease in job runtime
• Dependent on type of filter and size of job output
40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20%
Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12
© comScore, Inc. Proprietary. 34
Speed of Development - POC
We took an existing process that runs in our Hadoop cluster and converted
that to DMX-h to validate the new capabilities.
The existing process:
• Written in 75 lines of Pig with 3 Java UDFs
• Developed in about 25 hours
• Processes 3.5 billion input rows per day
• Takes 35 minutes to run on a daily basis
© comScore, Inc. Proprietary. 35
DMXh-Process
© comScore, Inc. Proprietary. 36
Speed of Development - POC
The new process in DMX-h:
• Developed a new job with 13 tasks
• No Java UDF required
• Runs on the same data and in the same environment.
• Developed in 12 hours.
• Runs in 11 minutes! 1/3 of the time of the Pig & Java code.
© comScore, Inc. Proprietary. 37
Useful Factoids
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
Colorful, bite-sized graphical representations of the best discoveries we unearth.
© comScore, Inc. Proprietary. 38
Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Today’s Presenters
Steve Wooledge
VP - Product Marketing
@swooledge
Jorge Lopez
Director - Product Marketing
@zanilli
Mike Brown
CTO
© 2014 MapR Technologies 3© 2014 MapR Technologies
comScore
© comScore, Inc. Proprietary.
Syncsort & MapR @ comScore
• Michael Brown, CTO | July 9th, 2014
© 2014 MapR Technologies 5© 2014 MapR Technologies
Leveraging MapR and Syncsort
© 2014 MapR Technologies 6
Big Data is Overwhelming Traditional Systems
• Mission-critical reliability
• Transaction guarantees
• Deep security
• Real-time performance
• Backup and recovery
• Interactive SQL
• Rich analytics
• Workload management
• Data governance
• Backup and recovery
Enterprise
Data
Architecture
1TRENDTREND
ENTERPRISE
USERS
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
PRODUCTION
REQUIREMENTS
PRODUCTION
REQUIREMENTS
OUTSIDE SOURCES
© 2014 MapR Technologies 7
Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND
JOB TRENDS FROM INDEED.COM
Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13
2
© 2014 MapR Technologies 8
OPERATIONAL
SYSTEMS
ANALYTICAL
SYSTEMS
ENTERPRISE
USERS
1REALITYREALITY
• Data staging
• Archive
• Data transformation
• Data exploration
• Streaming,
interactions
Hadoop Relieves the Pressure from Enterprise Systems
2 Interoperability
1 Reliability and DR
4
Supports operations
and analytics
3 High performance
Keys for Production Success
© 2014 MapR Technologies 9
FOUNDATION
Architecture Matters for Success2REALITYREALITY
Data protection
& security
High performance
Multi-tenancy
Operational &
Analytical Workloads
Open standards
for integration
NEW APPLICATIONS SLAs TRUSTEDINFORMATION LOWERTCO
© 2014 MapR Technologies 10
The Power of the Open Source Community
ManagementManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
* Certification/support planned for 2014
© 2014 MapR Technologies 11
MapR Distribution for Hadoop
ManagementManagement
MapR Data Platform
APACHE HADOOP AND OSS ECOSYSTEM
Security
YARN
Pig
Cascading
Spark
Batch
Spark
Streaming
Storm*
Streaming
HBase
Solr
NoSQL &
Search
Juju
Provisioning
&
coordination
Savannah*
Mahout
MLLib
ML, Graph
GraphX
MapReduce
v1 & v2
EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS
Workflow
& Data
Governance
Tez*
Accumulo*
Hive
Impala
Shark
Drill*
SQL
Sentry* Oozie ZooKeeperSqoop
Knox* WhirrFalcon*Flume
Data
Integration
& Access
HttpFS
Hue
* Certification/support planned for 2014
• High availability
• Data protection
• Disaster recovery
• Standard file access
• Standard database
access
• Pluggable services
• Broad developer
support
• Enterprise security
authorization
• Wire-level
authentication
• Data governance
• Ability to support
predictive analytics,
real-time database
operations, and
support high arrival
rate data
• Ability to logically
divide a cluster to
support different use
cases, job types,
user groups, and
administrators
• 2X to 7X higher
performance
• Consistent, low
latency
Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
© 2014 MapR Technologies 12
MapR: Best Solution for Customer Success
Top Ranked
Exponential
Growth
500+
Customers
Premier
Investors
3X3X bookings Q1 ‘13 – Q1 ‘14
80%80% of accounts expand 3X
90%90% software licenses
<1%<1% lifetime churn
>$1B>$1B in incremental revenue
generated by 1 customer
© 2014 MapR Technologies 13
MapR and Syncsort Reference Architecture
Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
BLOGS,
TWEETS,
LINK DATA
DATA MARTS DATA WAREHOUSE
MapR Data Platform
Business
Intelligence /
Visualization
MapR-DB MapR-FS
Batch
(MR, Spark, Hive, Pig,
…)
Interactive
(Impala, Drill, …)
Streaming
(Spark Streaming,
Storm…)
MAPR DISTRIBUTION FOR HADOOP
© 2014 MapR Technologies 14
Do You Know Syncsort?
• Syncsort provides fast, secure, enterprise‐grade 
software spanning “Big Iron to Big Data” 
• Fastest sort technology in the market
• Powering 50% of mainframes’ sort
• A history of innovation
• 25+ issued & pending patents
• Large global customer base
• 12,000+ deployments in 80 countries and serving 87 of 
the Fortune 100
• First‐to‐market, fully integrated approach to Hadoop 
ETL
• Top 7 contributors to Hadoop. Based on number of 
lines of code changed in 2013
Our customers are achieving the impossible, every 
day!
Our customers are achieving the impossible, every 
day!
Key Partners
© 2014 MapR Technologies 15
The Hadoop Challenge
PROCESS
Sort
JoinAggregate Copy
Merge
DISTRIBUTECOLLECT
Most organizations use Hadoop to…
EExtract
TTransform
LLoad
© 2014 MapR Technologies 16
Turning Hadoop into a Feature-rich ETL Solution
Collect
• Broad based connectivity with automated parallelism 
• Best in class mainframe data access & translation
Process & Distribute
• No manual coding. GUI for developing & maintaining MR jobs
• No code generation. Engine runs natively on each node
• Develop & test locally in Windows; run natively on Hadoop
Optimize & Secure
• Faster throughput per node
• Full support for Kerberos & LDAP
• Web‐based monitoring console
• Sort‐work compression for storage savings
DMX‐h 
ETL
Collect Process
& Distribute
Optimize
& Secure
© 2014 MapR Technologies 17
A Roadmap to Hadoop Success
Agile Data 
Exploration & 
Visualization
Next‐gen Analytics
Cheap Storage
Offload Data 
Warehouse
Enabling The
Data‐driven Organization
Solving The Intractable
IT Problem
17
© 2014 MapR Technologies 18
MapR + Syncsort Solutions
Data Warehouse 
Optimization
Click‐stream 
Analysis
Mainframe Offload
Shift ELT Workloads 
to Hadoop
Access, Translate & Analyze 
Mainframe Data with Hadoop
Collect, Process & Analyze More 
Data from Your Website
© 2014 MapR Technologies 19
Q&AEngage with us!
1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox
2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr
3. Learn best practices for Hadoop ETL: www.mapr.com/EDH

Weitere ähnliche Inhalte

Was ist angesagt?

Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Ontico
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata StreamingZoomdata
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryDataWorks Summit/Hadoop Summit
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureUtkarsh Pandey
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceNeev Technologies
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBlueData, Inc.
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectPrecisely
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 

Was ist angesagt? (20)

Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...Key trends in Big Data and new reference architecture from Hewlett Packard En...
Key trends in Big Data and new reference architecture from Hewlett Packard En...
 
Spark meetup - Zoomdata Streaming
Spark meetup  - Zoomdata StreamingSpark meetup  - Zoomdata Streaming
Spark meetup - Zoomdata Streaming
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Preventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive IndustryPreventative Maintenance of Robots in Automotive Industry
Preventative Maintenance of Robots in Automotive Industry
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Achieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azureAchieving cloud scale with microservices based applications on azure
Achieving cloud scale with microservices based applications on azure
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
Seamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with ConnectSeamless, Real-Time Data Integration with Connect
Seamless, Real-Time Data Integration with Connect
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Big Data at your Desk with KNIME
Big Data at your Desk with KNIMEBig Data at your Desk with KNIME
Big Data at your Desk with KNIME
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 

Ähnlich wie How to Suceed in Hadoop

Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Steven Totman
 
BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScoreeaiti
 
Using Hadoop
Using HadoopUsing Hadoop
Using Hadoopeaiti
 
November 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridNovember 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridYahoo Developer Network
 
Control m customers using big data
Control m customers using big dataControl m customers using big data
Control m customers using big dataJuliette Smit
 
Initiative Based Technology Consulting Case Studies
Initiative Based Technology Consulting Case StudiesInitiative Based Technology Consulting Case Studies
Initiative Based Technology Consulting Case Studieschanderdw
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsYong Feng
 
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Precisely
 
IMS01 IMS Keynote
IMS01   IMS KeynoteIMS01   IMS Keynote
IMS01 IMS KeynoteRobert Hain
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017Precisely
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalAvere Systems
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013ScaleOut Software
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017Precisely
 
From Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedFrom Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedDataCore Software
 
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce PlatformMongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce PlatformMongoDB
 
Learn the new rules of cloud storage
Learn the new rules of cloud storageLearn the new rules of cloud storage
Learn the new rules of cloud storageBuurst
 
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo AquinoFInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo AquinoHugo Aquino
 
Going Remote: Running VFX Virtual Workstations
Going Remote: Running VFX Virtual WorkstationsGoing Remote: Running VFX Virtual Workstations
Going Remote: Running VFX Virtual WorkstationsAmazon Web Services
 

Ähnlich wie How to Suceed in Hadoop (20)

Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013Syncsort & comScore Big Data Warehouse Meetup Sept 2013
Syncsort & comScore Big Data Warehouse Meetup Sept 2013
 
BigData @ comScore
BigData @ comScoreBigData @ comScore
BigData @ comScore
 
Using Hadoop
Using HadoopUsing Hadoop
Using Hadoop
 
November 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory gridNovember 2013 HUG: Real-time analytics with in-memory grid
November 2013 HUG: Real-time analytics with in-memory grid
 
Control m customers using big data
Control m customers using big dataControl m customers using big data
Control m customers using big data
 
Initiative Based Technology Consulting Case Studies
Initiative Based Technology Consulting Case StudiesInitiative Based Technology Consulting Case Studies
Initiative Based Technology Consulting Case Studies
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
Using Mainframe Data in the Cloud: Design Once, Deploy Anywhere in a Hybrid W...
 
IMS01 IMS Keynote
IMS01   IMS KeynoteIMS01   IMS Keynote
IMS01 IMS Keynote
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to Cloud
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Mainframe Optimization in 2017
Mainframe Optimization in 2017Mainframe Optimization in 2017
Mainframe Optimization in 2017
 
From Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedFrom Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the Unexpected
 
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce PlatformMongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
MongoDB World 2018: Breaking the Mold - Redesigning Dell's E-Commerce Platform
 
Learn the new rules of cloud storage
Learn the new rules of cloud storageLearn the new rules of cloud storage
Learn the new rules of cloud storage
 
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo AquinoFInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
FInal Project - USMx CC605x Cloud Computing for Enterprises - Hugo Aquino
 
Going Remote: Running VFX Virtual Workstations
Going Remote: Running VFX Virtual WorkstationsGoing Remote: Running VFX Virtual Workstations
Going Remote: Running VFX Virtual Workstations
 

Mehr von Precisely

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfPrecisely
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Precisely
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Precisely
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Precisely
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fPrecisely
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsPrecisely
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPPrecisely
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenPrecisely
 
Automatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsAutomatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsPrecisely
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyPrecisely
 
Effective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowEffective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowPrecisely
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellencePrecisely
 
5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation ManagementPrecisely
 
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowUnlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowPrecisely
 
Navigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckNavigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckPrecisely
 
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformanceMainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformancePrecisely
 
Preventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations ManagementPreventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations ManagementPrecisely
 

Mehr von Precisely (20)

Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Crucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdfCrucial Considerations for AI-ready Data.pdf
Crucial Considerations for AI-ready Data.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10Justifying Capacity Managment Webinar 4/10
Justifying Capacity Managment Webinar 4/10
 
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
Automate Studio Training: Materials Maintenance Tips for Efficiency and Ease ...
 
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
Leveraging Mainframe Data in Near Real Time to Unleash Innovation With Cloud:...
 
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3fTestjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
Testjrjnejrvnorno4rno3nrfnfjnrfnournfou3nfou3f
 
Data Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity TrendsData Innovation Summit: Data Integrity Trends
Data Innovation Summit: Data Integrity Trends
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Optimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAPOptimisez la fonction financière en automatisant vos processus SAP
Optimisez la fonction financière en automatisant vos processus SAP
 
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige InvestitionenSAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
SAPS/4HANA Migration - Transformation-Management + nachhaltige Investitionen
 
Automatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIsAutomatisierte SAP Prozesse mit Hilfe von APIs
Automatisierte SAP Prozesse mit Hilfe von APIs
 
Moving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and PreciselyMoving IBM i Applications to the Cloud with AWS and Precisely
Moving IBM i Applications to the Cloud with AWS and Precisely
 
Effective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to KnowEffective Security Monitoring for IBM i: What You Need to Know
Effective Security Monitoring for IBM i: What You Need to Know
 
Automate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center ExcellenceAutomate Your Master Data Processes for Shared Service Center Excellence
Automate Your Master Data Processes for Shared Service Center Excellence
 
5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management5 Keys to Improved IT Operation Management
5 Keys to Improved IT Operation Management
 
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter TomorrowUnlock Efficiency With Your Address Data Today For a Smarter Tomorrow
Unlock Efficiency With Your Address Data Today For a Smarter Tomorrow
 
Navigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar DeckNavigating Cloud Trends in 2024 Webinar Deck
Navigating Cloud Trends in 2024 Webinar Deck
 
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak PerformanceMainframe Sort Operations: Gaining the Insights You Need for Peak Performance
Mainframe Sort Operations: Gaining the Insights You Need for Peak Performance
 
Preventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations ManagementPreventing Downtime with Better IT Operations Management
Preventing Downtime with Better IT Operations Management
 

Kürzlich hochgeladen

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 

Kürzlich hochgeladen (20)

Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 

How to Suceed in Hadoop

  • 1. © comScore, Inc. Proprietary. Syncsort & MapR @ comScore Michael Brown, CTO | July 9th, 2014
  • 2. © comScore, Inc. Proprietary.© comScore, Inc. Proprietary. The comScore Story Analytics for a Digital World™
  • 3. © comScore, Inc. Proprietary. 3 The Digital World is Complex V0113
  • 4. © comScore, Inc. Proprietary. 4 comScore’s Mission Be the Leader in Digital Media Analytics. Measure all forms of media—content and advertising—at scale, across all platforms, in real-time, globally.
  • 5. © comScore, Inc. Proprietary. 5 comScore Brings it Together TabletPC/Mac TV SmartphoneGaming V0113
  • 6. © comScore, Inc. Proprietary. 6 comScore is a leading internet technology company that provides Analytics for a Digital World™ NASDAQ SCOR Clients 2,400+ Worldwide Employees 1,200+ Headquarters Reston, Virginia, USA Global Coverage Measurement from 172 Countries; 44 Markets Reported Local Presence 32 Locations in 23 Countries V0113
  • 7. © comScore, Inc. Proprietary. 7 Providing Analytics For More Than 2,400+ Clients Globally Media Agencies Telecom/Mobile Financial Retail Travel CPG Health Technology V0113
  • 8. © comScore, Inc. Proprietary. 8 Census Tags & Data Feeds Panels PC, iOS, Android Survey Non-behavioral elements Methods Aggregation Dictionaries Taxonomies Syndicated Data Platform Media Metrix vCE Collection Calibration Delivery Consulting Analysis Models Weighting Projection De-Duplication Attribution Turning Big Data into Powerful Insight Client Analytics Platform Digital Analytix
  • 9. © comScore, Inc. Proprietary. 9
  • 10. © comScore, Inc. Proprietary. 10 Panel Heat Map
  • 11. © comScore, Inc. Proprietary. 11 Average Records Captured per Day (2005-2009) - 200,000,000 400,000,000 600,000,000 800,000,000 1,000,000,000 1,200,000,000 1,400,000,000 1,600,000,000 1,800,000,000 9/26/2005 10/26/2005 11/26/2005 12/26/2005 1/26/2006 2/26/2006 3/26/2006 4/26/2006 5/26/2006 6/26/2006 7/26/2006 8/26/2006 9/26/2006 10/26/2006 11/26/2006 12/26/2006 1/26/2007 2/26/2007 3/26/2007 4/26/2007 5/26/2007 6/26/2007 7/26/2007 8/26/2007 9/26/2007 10/26/2007 11/26/2007 12/26/2007 1/26/2008 2/26/2008 3/26/2008 4/26/2008 5/26/2008 6/26/2008 7/26/2008 8/26/2008 9/26/2008 10/26/2008 11/26/2008 12/26/2008 1/26/2009 2/26/2009 3/26/2009
  • 12. © comScore, Inc. Proprietary. 12 CENSUS Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration Adopted by 90% of Top 100 U.S. Media Properties PANEL Unified Digital Measurement (UDM) Patent-Pending Methodology Global PERSON Measurement Global DEVICE Measurement V0411
  • 13. © comScore, Inc. Proprietary. 13 Beacon Heat Map
  • 14. © comScore, Inc. Proprietary. 14 Monthly Records Collection Billion 200 Billion 400 Billion 600 Billion 800 Billion 1,000 Billion 1,200 Billion 1,400 Billion 1,600 Billion 1,800 Billion 2,000 Billion #ofrecords Beacon Records Panel Records Total records collected in June 2014 = 1,726,563,202,649 Total records collected YTD 2014 = 10,037,131,368,475
  • 15. © comScore, Inc. Proprietary. DMX @ comScore
  • 16. © comScore, Inc. Proprietary. 16 DMX use at comScore Purchased our first 4 licenses in 2000! We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation. We currently run over 100+ unique jobs every day. With these jobs we process over 150 billion rows of data through DMX! Connect Design Process Accelerate
  • 17. © comScore, Inc. Proprietary. 17 Compression w/Sorting Compress Log Files when processing large volumes of log data Several advantages to Sorting Data First:  Reduces the size of the data  Improves application performance Examples:  1 Hour of one source of our data 2,315 GB raw (2.9 billion rows)  Standard compression of time ordered data is 509 GB (22% of original)  Standard compression on a sorted set is 324 GB (14% of original) When applied to all our sources we save  5.0 TB per day  155 TB per month  460 TB per quarter
  • 18. © comScore, Inc. Proprietary. Hadoop @ comScore
  • 19. © comScore, Inc. Proprietary. 19 Why Hadoop? • comScore built our own distributed computing stack in 2002. • In 2009 we decided it was better to leverage the efforts of the Hadoop community instead of building our own stack. • We recognized the benefit of switching to Hadoop which would allow for seamless scaling of our infrastructure to meet the needs of the business. • Hadoop allows us to add compute, storage and memory linearly and allows you to process things at tremendous scale. • Partnered with SyncSort on their Hadoop efforts from Oct 2010 • Evaluated the beta of MapR in the fall of 2011
  • 20. © comScore, Inc. Proprietary. 20 90 Days of Data 1,148 1,919 3,049 4,862 5,084 Trillion 1,000 Trillion 2,000 Trillion 3,000 Trillion 4,000 Trillion 5,000 Trillion 6,000 Trillion 2009 2010 2011 2012 2013 2014 2016
  • 21. © comScore, Inc. Proprietary. 21 High Level Data Flow Panel Census Custom Code + ADW EDW Delivery
  • 22. © comScore, Inc. Proprietary. 22 Our Cluster Production Hadoop Cluster  400+ nodes: Mix of Dell 720xd, R710 and R510 servers  Each R720xd has (24x1.2TB drives; 128GB RAM; 32 cores)  13,800+ total CPUs  31.6 TB total memory  8.2 PB total disk space  Our distro is MapR M5 2.1.3
  • 23. © comScore, Inc. Proprietary. Leveraging Partitions from MapR
  • 24. © comScore, Inc. Proprietary.
  • 25. © comScore, Inc. Proprietary. Validation Funnel & Target Effectiveness
  • 26. © comScore, Inc. Proprietary. 26 Our growth As our volume has grown we have the following stats:  Over 683 billion events per month  Daily Aggregate 1.8 billion  160 billion aggregate records for 92 days  146K Campaigns  Over 50 countries  We see 15 billion distinct cookies in a month  We only need to output 26 million rows
  • 27. © comScore, Inc. Proprietary. 27 Solution to reduce the shuffle The Problem:  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues The Idea:  Partition and sort the data by cookie on a daily basis  Create a custom InputFormat to merge daily partitions for monthly aggregations
  • 28. © comScore, Inc. Proprietary. 28 Custom Input Format with Map Side Aggregation CB Mapper MapperMapperMap Map Map Reduce ReduceReduce BA AC A B C A B C Combiner Combiner Combiner A B C
  • 29. © comScore, Inc. Proprietary. 29 Risks for Partitioning Data locality  Custom InputFormat requires reading blocks of the partitioned data over the network  This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one node Map failures might result in long run times  Size of the map inputs is no longer set by block size  This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper
  • 30. © comScore, Inc. Proprietary. 30 Partitioning Summary Benefits:  A large portion of the aggregation can be completed in the map phase  Applications can now take advantage of combiners  Shuffles sizes are minimal Results:  Took a job from 35 hours to 3 hours with no hardware changes
  • 31. © comScore, Inc. Proprietary. DMX-h @ comScore
  • 32. © comScore, Inc. Proprietary. 32 Reasons for comScore selecting DMX-h Performance • DMX-h as the pluggable sort in Hadoop allows us to increase throughput on it’s existing platform; this reduces capital and ongoing operational expenses • The increase in throughput allows us to also deliver our data more quickly to our customers. These things make the data more valuable to our clients. Speed of Development • The ability to quickly build out applications in the DMX-h GUI allows us to iterate and respond quicker to the needs of the business. • The ease of development also allows us to democratize the access to the Hadoop platform by leveraging a point and click GUI.
  • 33. © comScore, Inc. Proprietary. 33 Performance - DMx Pluggable Sort Testing Results First Comparison Run on our Dev Cluster Pig scripts and called with SyncSort plug in GroupBy / Distinct Operations • Counting uniques • These have large shuffle steps which leads to more data to sort. • Observed up to a 20% decrease in job runtime Filter Operations • Searching for a specific value • Observed a 5% – 10% decrease in job runtime • Dependent on type of filter and size of job output 40GB compressed data, base run is 86 min, test run is 68 min; Savings of 20% Results from 7 Nodes; 56 cores; 433 GB RAM; 28 TB disk; MapR M5 3.0.2; DMX-h 7.12
  • 34. © comScore, Inc. Proprietary. 34 Speed of Development - POC We took an existing process that runs in our Hadoop cluster and converted that to DMX-h to validate the new capabilities. The existing process: • Written in 75 lines of Pig with 3 Java UDFs • Developed in about 25 hours • Processes 3.5 billion input rows per day • Takes 35 minutes to run on a daily basis
  • 35. © comScore, Inc. Proprietary. 35 DMXh-Process
  • 36. © comScore, Inc. Proprietary. 36 Speed of Development - POC The new process in DMX-h: • Developed a new job with 13 tasks • No Java UDF required • Runs on the same data and in the same environment. • Developed in 12 hours. • Runs in 11 minutes! 1/3 of the time of the Pig & Java code.
  • 37. © comScore, Inc. Proprietary. 37 Useful Factoids Visit www.comscoredatamine.com or follow @datagems for the latest gems. Colorful, bite-sized graphical representations of the best discoveries we unearth.
  • 38. © comScore, Inc. Proprietary. 38 Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com
  • 39. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 40. © 2014 MapR Technologies 2 Today’s Presenters Steve Wooledge VP - Product Marketing @swooledge Jorge Lopez Director - Product Marketing @zanilli Mike Brown CTO
  • 41. © 2014 MapR Technologies 3© 2014 MapR Technologies comScore
  • 42. © comScore, Inc. Proprietary. Syncsort & MapR @ comScore • Michael Brown, CTO | July 9th, 2014
  • 43. © 2014 MapR Technologies 5© 2014 MapR Technologies Leveraging MapR and Syncsort
  • 44. © 2014 MapR Technologies 6 Big Data is Overwhelming Traditional Systems • Mission-critical reliability • Transaction guarantees • Deep security • Real-time performance • Backup and recovery • Interactive SQL • Rich analytics • Workload management • Data governance • Backup and recovery Enterprise Data Architecture 1TRENDTREND ENTERPRISE USERS OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS PRODUCTION REQUIREMENTS PRODUCTION REQUIREMENTS OUTSIDE SOURCES
  • 45. © 2014 MapR Technologies 7 Hadoop: The Disruptive Technology at the Core of Big DataTRENDTREND JOB TRENDS FROM INDEED.COM Jan ‘06 Jan ‘12 Jan ‘14Jan ‘07 Jan ‘08 Jan ‘09 Jan ‘10 Jan ‘11 Jan ‘13 2
  • 46. © 2014 MapR Technologies 8 OPERATIONAL SYSTEMS ANALYTICAL SYSTEMS ENTERPRISE USERS 1REALITYREALITY • Data staging • Archive • Data transformation • Data exploration • Streaming, interactions Hadoop Relieves the Pressure from Enterprise Systems 2 Interoperability 1 Reliability and DR 4 Supports operations and analytics 3 High performance Keys for Production Success
  • 47. © 2014 MapR Technologies 9 FOUNDATION Architecture Matters for Success2REALITYREALITY Data protection & security High performance Multi-tenancy Operational & Analytical Workloads Open standards for integration NEW APPLICATIONS SLAs TRUSTEDINFORMATION LOWERTCO
  • 48. © 2014 MapR Technologies 10 The Power of the Open Source Community ManagementManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue * Certification/support planned for 2014
  • 49. © 2014 MapR Technologies 11 MapR Distribution for Hadoop ManagementManagement MapR Data Platform APACHE HADOOP AND OSS ECOSYSTEM Security YARN Pig Cascading Spark Batch Spark Streaming Storm* Streaming HBase Solr NoSQL & Search Juju Provisioning & coordination Savannah* Mahout MLLib ML, Graph GraphX MapReduce v1 & v2 EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS Workflow & Data Governance Tez* Accumulo* Hive Impala Shark Drill* SQL Sentry* Oozie ZooKeeperSqoop Knox* WhirrFalcon*Flume Data Integration & Access HttpFS Hue * Certification/support planned for 2014 • High availability • Data protection • Disaster recovery • Standard file access • Standard database access • Pluggable services • Broad developer support • Enterprise security authorization • Wire-level authentication • Data governance • Ability to support predictive analytics, real-time database operations, and support high arrival rate data • Ability to logically divide a cluster to support different use cases, job types, user groups, and administrators • 2X to 7X higher performance • Consistent, low latency Enterprise-grade Security OperationalPerformance Multi-tenancyInteroperability
  • 50. © 2014 MapR Technologies 12 MapR: Best Solution for Customer Success Top Ranked Exponential Growth 500+ Customers Premier Investors 3X3X bookings Q1 ‘13 – Q1 ‘14 80%80% of accounts expand 3X 90%90% software licenses <1%<1% lifetime churn >$1B>$1B in incremental revenue generated by 1 customer
  • 51. © 2014 MapR Technologies 13 MapR and Syncsort Reference Architecture Sources RELATIONAL, SAAS, MAINFRAME DOCUMENTS, EMAILS LOG FILES, CLICKSTREAMS BLOGS, TWEETS, LINK DATA DATA MARTS DATA WAREHOUSE MapR Data Platform Business Intelligence / Visualization MapR-DB MapR-FS Batch (MR, Spark, Hive, Pig, …) Interactive (Impala, Drill, …) Streaming (Spark Streaming, Storm…) MAPR DISTRIBUTION FOR HADOOP
  • 52. © 2014 MapR Technologies 14 Do You Know Syncsort? • Syncsort provides fast, secure, enterprise‐grade  software spanning “Big Iron to Big Data”  • Fastest sort technology in the market • Powering 50% of mainframes’ sort • A history of innovation • 25+ issued & pending patents • Large global customer base • 12,000+ deployments in 80 countries and serving 87 of  the Fortune 100 • First‐to‐market, fully integrated approach to Hadoop  ETL • Top 7 contributors to Hadoop. Based on number of  lines of code changed in 2013 Our customers are achieving the impossible, every  day! Our customers are achieving the impossible, every  day! Key Partners
  • 53. © 2014 MapR Technologies 15 The Hadoop Challenge PROCESS Sort JoinAggregate Copy Merge DISTRIBUTECOLLECT Most organizations use Hadoop to… EExtract TTransform LLoad
  • 54. © 2014 MapR Technologies 16 Turning Hadoop into a Feature-rich ETL Solution Collect • Broad based connectivity with automated parallelism  • Best in class mainframe data access & translation Process & Distribute • No manual coding. GUI for developing & maintaining MR jobs • No code generation. Engine runs natively on each node • Develop & test locally in Windows; run natively on Hadoop Optimize & Secure • Faster throughput per node • Full support for Kerberos & LDAP • Web‐based monitoring console • Sort‐work compression for storage savings DMX‐h  ETL Collect Process & Distribute Optimize & Secure
  • 55. © 2014 MapR Technologies 17 A Roadmap to Hadoop Success Agile Data  Exploration &  Visualization Next‐gen Analytics Cheap Storage Offload Data  Warehouse Enabling The Data‐driven Organization Solving The Intractable IT Problem 17
  • 56. © 2014 MapR Technologies 18 MapR + Syncsort Solutions Data Warehouse  Optimization Click‐stream  Analysis Mainframe Offload Shift ELT Workloads  to Hadoop Access, Translate & Analyze  Mainframe Data with Hadoop Collect, Process & Analyze More  Data from Your Website
  • 57. © 2014 MapR Technologies 19 Q&AEngage with us! 1. Download the MapR Sandbox for Hadoop: www.mapr.com/sandbox 2. Try Syncsort’s Hadoop ETL in the MapR Sandbox: www.syncsort.com/mapr 3. Learn best practices for Hadoop ETL: www.mapr.com/EDH