How To Implement Hadoop Successfully

How To Implement Hadoop
Successfully!
Based on: Avinash Kaushik
By Adir Sharabi

24%
of Hadoop projects are
actually in production.
Only
By Rainstor

About Conduit
Over 250 million active end users
More than 260,000 publishers
Over 3 billion monthly user interactions
Deployed in 120 countries
Founded in 2005
Acquired Wibiya in 2011

Agg.
Files
Usage
Files
Usage Records
Hadoop
Hbase
HDFS
DWH
Product
Optimization Engine
Insights
Hive
MySQL
Hue
Integration Services
Reporting Services
Business Objects
R
Mahout
Oozie
Conduit’s Data Platform
Business
Streaming
Kafka WEPs
Real Time
Monitoring

Tip #1
Don't buy the hype of
‘big data’ and throw
millions of dollars away,
but don’t stand still.

Tip #1
 Select 1 well defined use case
 Small super-smart team
 Experiment on the cloud
 Quantify the effort and value for your organization
 ‘fail faster while failing forward’

Conduit’s initial use case
Merge Extract
Users Pings Users Table Daily
Installations
50M 600M
7 Hour 1 Hour
Before: 8-10 Hours
Merge Extract
Users Pings Users Table Daily
Installations
120M 2.2B
Today: 30 Minutes!

0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
300
320
340
360
380
400
420
440
460
data size (TB) # of Nodes
Conduit’s Big Data Growth (5TB to 500TB)
Jan 2009
DWH Launched
Mar 2010
Hadoop Launched
on cloud (8 nodes)
Feb 2011
Hadoop Deployed
on conduit’s data center
(72 nodes)
Jan & Oct 2012
Procurement
(105/120 nodes)
Sep 2013
Procurement – DR

Conduit’s Data Platform in Numbers
• Hardware:
125 Nodes (+70 after DR) on 6 racks
500TB Used/1.2 PB Total
• Daily processed data:
50,000 files
500,000,000 records
700 GB
• Daily jobs submitted: Over 5,000
• Data freshness: 60 minutes

Tip #2
Data is turning challenges
into business opportunities.

8%
8%
9%
9%
10%
11%
13%
15%
19%
0% 5% 10% 15% 20%
analyze complete rather than partial data sets
other
Customer intelligence for more targeted
marketing
Include more semi-structure/unstructured info
into decision making
Improve scientific research
ETL
log analysis
Reduce cost of data analysis
Mine data for business intelligence
Use Cases

Business Model Maturity Index
Business
Insights
Business
Optimization
Business
Monitoring
Data
Monetization
Business
Metamorphosis
Monitoring
business
performance to
flag areas of
interest
Integrate insights
&
recommendations
into existing
business processes
Embed analytics
to optimize
business
processes
Leverage insights
to identify new
revenue
opportunities
Transform
customer and
product insights
to move into
new markets
© Copyright 2013 EMC Corporation. All rights reserved

But…
 Hadoop in the Enterprise Eco System – lot of the features
Enterprises need or want are put on the back seat
 Hadoop is NOT cheap (H/W & operations cost) – Make
sure company’s decision makers are on board
 Hadoop is still rough on the edges – tooling may not be
as mature as Enterprises are used to
 Data access is batch oriented

Tip #3
The 10/90 rule for magnificent
data success.

Tip #3
 Nurture your ‘big brains’
 Hadoop cutting edge technology – Investment in related
skills and training is crucial
 Good Data Scientists are “unicorns”
 Embrace the Open Source culture it will payoff
 BI team is essential for connecting the dots

Data Roles @ Conduit
Product
Mobile
Data Infra Team
Data BI Team
Data Science Team
Wibiya Quick Launch
Toolbar
BI
Scientist Scientist Scientist Scientist
BI BI BI
Other
Scientist
BI

Tip #4
Shoot for right time data,
not real time data.

Tip #4
 Complex decision making is time consuming therefore
unable to react in real time
 Real time is expensive!
 Taylor the right solution to accommodate the required data
freshness
 Focus on big things!

Data Maturity vs. Freshness @Conduit
0 10 60
Low
Medium
High
Real Time
Monitoring
Hue/Hive
Reporting
Service
Advanced
Analytics
Models
Business
Objective
Advanced
Analytics
Models
Reporting
Service
Freshness
Data Maturity
(Structured,
cleansed &
completed(
Hadoop
DWH
Kafka

Tip #5
Data quality sucks,
just get over it!

Tip #5
 Data will be dirty, schema-less, no foreign keys
 And yet, we are standing on a mountain of gold!
 Make your best and know when to shift to data analysis
 Tune your algorithms to tolerate data deficiencies then
hunt for insights
 Big data is not Data Warehouse

Tip #6
 Break down barriers preventing our users/applications from
using their valuable data in more effective ways to glean
meaningful insights
 Provide your users advanced self service tools to access the
data
 Hadoop ecosystem evolving as we speak
 Your performance is measured by the tools effectiveness
and ease of use

To Summarize…
• Start small
• Identify the opportunities
• Invest in people & related skills
• Adjust processes to the organization needs
• Know your data limits
• Self Service Tools are extremely important

Q&A
il.linkedin.com/pub/adir-sharabi/3b/6ab/510/

How To Implement Hadoop Successfully

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How To Implement Hadoop Successfully

Ähnlich wie How To Implement Hadoop Successfully (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How To Implement Hadoop Successfully

Hinweis der Redaktion