SlideShare ist ein Scribd-Unternehmen logo
1 von 24
BIG DATA:
IT’S MORE THAN VOLUME
Nachum Shacham
PayPal
Big Data Innovation Summit
April 2013
IT’S BIG-DATA TIME!
Volume  big platforms
Variety  multiple data types
Velocity  fast response
Value  a treasure of patterns
TECHNOLOGY HYPE CYCLE
3 DM Tech Forum
BIG DATA
MIXED SIGNALS FROM THE PUNDITS
• Data Lake
• “Needle in a hay stack”
• “All hay no needles”
• “Yet another fad”
• “Noth’n new: we’ve been analyzing
data for 30 years”
4 DM Tech Forum
• “Store’em and they’ll come”
• “Don’t ever discard data”
• “$524.752MM ROI in 3 years”
• “Smart” …
• “Hadoop is free”
• “Just…”
USE YOUR OWN FILTER
• Sift facts from MBS
• Seek factual 1-liners
• See through metaphors
• Discount “Smart” (data, algorithms, systems)
• Be skeptical
5 DM Tech Forum
UNLOCK THE VALUE IN BIG DATA
• Data Trumps Algorithms
• Sufficient data further down the long tail
• Wisdom of the crowd  effective recommendations
• Combine signals from different media
6 DM Tech Forum
BUSINESS VALUE IN BIG DATA
7 DM Tech Forum
RISK ANALYSIS
IDENTIFY INFLUENCERS IN
SOCIAL GRAPHONLINE ADS
REVENUE OPTIMIZATION
FRAUD DETECTION
AND PREVENTION
LET’S DIG INTO BIG DATA
• Define KPIs
• Explore
• Model & Measure
• Visualize signals
• Test
• Question test results
• Rinse and Repeat
8 DM Tech Forum
BIG-DATA ANALYTICS
FROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS
9
MapAttempt TASK_TYPE="SETUP"
TASKID="task_201212150932_52151_m_000051"
TASK_ATTEMPT_ID="attempt_201212150932_52151_m
_000051_0" TASK_STATUS="SUCCESS"
Task TASKID="task_201212150932_52151_m_000051"
TASK_TYPE="SETUP" TASK_STATUS="SUCCESS"
FINISH_TIME="1355822133162"
COUNTERS="{(FileSystemCounters)(FileSystemCounter
s)[(FILE_BYTES_WRITTEN)
Cloud
RDBMS Data Warehouse Hadoop
MPP PLATFORMS AS WORKBENCHES
FOR BIG DATA AND THEIR TOOLS
CLASSES OF ANALYTICS JOBS
Big
Data
Data
organization
for BI
A few
large
models
Many
small
models
11
DATA MANIPULATION
GRAPHICS
MODEL BUILDING
CROSS VALIDATION
PROBLEM MR
FORMULATION
MATCH THE JOB TO THE PLATFORM
Data
Sourcing
Data
Preparation
Exploratory
Data Analysis
Predictive
Models
Visualization
Reporting
R: THE TOOL FOR ALL ANALYTICS STEPS
R
data files
process lines
set sorting key and value
output <key, value>
Collect segment data marked by key
Process segment data
Output processed segment data
Shuffle sort
Reducer.R
Mapper.py
Text processing
Model per segment
BI-LINGUAL HADOOP STREAMING:
LARGE SCALE PARALLEL PREDICTIVE MODELING
SEMI-STRUCTURED DATA  TABULAR DATA
Meta VERSION="1" .
Job JOBID="job_201212150932_52151" JOBNAME=”DataFilter" USER=”user1234” SUBMIT_TIME="1355822133394"
JOBCONF="hdfs://tmp/hadoop-hadoop/mapred/staging/user1234/.staging/job_201212150932_52151/job.xml"
VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE=”B" .
Job JOBID="job_201212150932_52151" JOB_PRIORITY="NORMAL" .
Job JOBID="job_201212150932_52151"
LAUNCH_TIME="1355822223576" TOTAL_MAPS="50" TOTAL_REDUCES="0" JOB_STATUS="PREP" .
Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" START_TIME="1355822133148" SPLITS="" .
MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051”
TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0"
START_TIME="1355822133545"
TRACKER_NAME="tracker_dn0492.ebay.com:localhost.localdomain/127.0.0.1:33613" HTTP_PORT="50060" .
MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051"
TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS"
Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS"
FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN)
(FILE_BYTES_WRITTEN)(27089)]}{(org.apache.hadoop.mapred.Task$Counter)
(Map-Reduce Framework)[(SPILLED_RECORDS)(Spilled Records)(0)]}" .
Job JOBID="job_201212150932_52151" JOB_STATUS="RUNNING" .
Task TASKID="task_201212150932_52151_m_000001" TASK_TYPE="MAP" START_TIME="1355822133163"
attempt,201212171719,248176,m,000013,0,1355499674337,1355499903213,MAP,SUCCESS,default,rack3,lvsaishdc3dn0109,0109
attempt,2012121771719,248176,m,000464,0,1355501042650,1355501253259,MAP,SUCCESS,default,rack5,lvsaishdc3dn0217,0217
attempt,2012121771719,248176,m,000626,0,1355501212902,1355501366476,MAP,SUCCESS,default,rack17,lvsaishdc3dn0776,077
6
attempt,2012121771719,248176,m,001193,0,1355499673762,1355499887662,MAP,SUCCESS,default,rack8,lvsaishdc3dn0366,036
attempt,2012121771719,248176,m,001355,0,1355499673545,1355499908182,MAP,SUCCESS,default,rack9,lvsaishdc3dn0386,0386
attempt,2012121771719,248176,m,001517,0,1355501266524,1355501470527,MAP,SUCCESS,default,rack5,lvsaishdc3dn0236,0236
attempt,2012121771719,248176,m,001850,0,1355501303142,1355501486691,MAP,SUCCESS,default,rack5,lvsaishdc3dn0235,0235
FROM TABULAR DATA TO BI
16 DM Tech Forum
PARALLEL SEGMENTED MODELING
17
R
R
R
R
R
MAPPERS
REDUCERS
MODELS BUILT ON LARGE DATASETS
18
Meta VERSION="1" .
Job JOBID="job_201112150932_52151"
JOBNAME=”DataFilter"
USER=”user1234”
LAUNCH_TIME="1324801865576”
TIME INTERVAL DATA
CONCURRENCY
PERCENTILES
TIME SERIESWORD COUNT
REPRESENTATION
AVOID RAM LIMITATIONS
R STAT
PROCESSING
Cloud
R LEVERAGING RDBMS POWER
teradataR Scidb-R
TERADATAR FUNCTIONS (SAMPLE)
Function Name What it does
td.zscore Zscore Transformation
td.t.paired T Test Paired
td.cor Correlation Matrix
td.f.oneway One way F Test
td.factanal Factor Analysis
td.freq Frequency Analysis
td.hist Histograms
td.kmeans K-Means Clustering
td.ks Kolmogorov Smirnov Test
td.mode Mode Value of Column
td.tapply Apply a function over a database column
td.summary Like R summary()
td.quantiles Quantile Values
td.rank Rank
ANALYSIS OF A TABLE WITH > 1B ROWS
>library(RJDBC)
>library(teradataR)
>tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”)
> system.time(myTbldf <- td.data.frame(”myTbl"))
user system elapsed
0.092 0.054 140.071
> dim(myTbldf )
[1] 1,131,670,269 9
> system.time(cor <- td.cor(myTbldf[3:9]))
user system elapsed
0.021 0.003 6.722
C D E F G H I
C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803
D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683
E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034
F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032
G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435
H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733
I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000
CONCLUSION
• Big data is here. See through the hype
• Analyze big data to extract value
• Multiple technologies & analytics tools are out there
• Match platform, tools and approach
• Delegate massive processing to big clusters
QUESTIONS?
BIG DATA EMPOWERS ALGORITHMS
Banko & Brill “Scaling to Very Very Large Corpora for
Natural Language Disambiguation”

Weitere ähnliche Inhalte

Andere mochten auch

PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopDataWorks Summit
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...Data Insight
 
Paymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundPaymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundShannon Sofield
 
PayPal: A case study
PayPal: A case studyPayPal: A case study
PayPal: A case studyKimberly Teo
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (6)

PayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on HadoopPayPal Behavioral Analytics on Hadoop
PayPal Behavioral Analytics on Hadoop
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...eCommerce and  ePayments markets in  Russia :  trends ,  analytics , perspect...
eCommerce and ePayments markets in Russia : trends , analytics , perspect...
 
Paymetrics Deck - Seed Round
Paymetrics Deck - Seed RoundPaymetrics Deck - Seed Round
Paymetrics Deck - Seed Round
 
PayPal: A case study
PayPal: A case studyPayPal: A case study
PayPal: A case study
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Big Data: It's More Than Volume, Paypal

Working with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBAWorking with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBAIgor Donchovski
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao IntroductionBooch Lin
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkMongoDB
 
SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual AnalyticsSlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual AnalyticsJohn De Goes
 
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...Codemotion
 
Portfolio Oversight With eazyBI
Portfolio Oversight With eazyBIPortfolio Oversight With eazyBI
Portfolio Oversight With eazyBIeazyBI
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchAppsBradley Holt
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
 
Hypermedia-driven Web Services with Spring Data REST
Hypermedia-driven Web Services with Spring Data RESTHypermedia-driven Web Services with Spring Data REST
Hypermedia-driven Web Services with Spring Data RESTSofiia Vynnytska
 
Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Ludovico Caldara
 
The Open & Social Web - Kings of Code 2009
The Open & Social Web - Kings of Code 2009The Open & Social Web - Kings of Code 2009
The Open & Social Web - Kings of Code 2009Chris Chabot
 
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015NoSQLmatters
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsAndrew Morgan
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBJake Dejno
 
Introduction to CQRS and DDDD
Introduction to CQRS and DDDDIntroduction to CQRS and DDDD
Introduction to CQRS and DDDDVladik Khononov
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual domAnhPham348
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual domAnhPham348
 
Data-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsData-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsDATAVERSITY
 

Ähnlich wie Big Data: It's More Than Volume, Paypal (20)

Working with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBAWorking with MongoDB as MySQL DBA
Working with MongoDB as MySQL DBA
 
GreenDao Introduction
GreenDao IntroductionGreenDao Introduction
GreenDao Introduction
 
Blazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & SparkBlazing Fast Analytics with MongoDB & Spark
Blazing Fast Analytics with MongoDB & Spark
 
SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual AnalyticsSlamData - How MongoDB Is Powering a Revolution in Visual Analytics
SlamData - How MongoDB Is Powering a Revolution in Visual Analytics
 
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK -  Nicola Iarocci - Co...
RESTFUL SERVICES MADE EASY: THE EVE REST API FRAMEWORK - Nicola Iarocci - Co...
 
Portfolio Oversight With eazyBI
Portfolio Oversight With eazyBIPortfolio Oversight With eazyBI
Portfolio Oversight With eazyBI
 
OSCON 2011 CouchApps
OSCON 2011 CouchAppsOSCON 2011 CouchApps
OSCON 2011 CouchApps
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
Hypermedia-driven Web Services with Spring Data REST
Hypermedia-driven Web Services with Spring Data RESTHypermedia-driven Web Services with Spring Data REST
Hypermedia-driven Web Services with Spring Data REST
 
Let your DBAs get some REST(api)
Let your DBAs get some REST(api)Let your DBAs get some REST(api)
Let your DBAs get some REST(api)
 
Mongo db presentation
Mongo db presentationMongo db presentation
Mongo db presentation
 
The Open & Social Web - Kings of Code 2009
The Open & Social Web - Kings of Code 2009The Open & Social Web - Kings of Code 2009
The Open & Social Web - Kings of Code 2009
 
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
Building LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDBBuilding LinkedIn's Learning Platform with MongoDB
Building LinkedIn's Learning Platform with MongoDB
 
Introduction to CQRS and DDDD
Introduction to CQRS and DDDDIntroduction to CQRS and DDDD
Introduction to CQRS and DDDD
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual dom
 
The truth behind virtual dom
The truth behind virtual domThe truth behind virtual dom
The truth behind virtual dom
 
Data-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling FundamentalsData-Ed Webinar: Data Modeling Fundamentals
Data-Ed Webinar: Data Modeling Fundamentals
 

Mehr von Innovation Enterprise

Marketing Technology Organizational Models
Marketing Technology Organizational ModelsMarketing Technology Organizational Models
Marketing Technology Organizational ModelsInnovation Enterprise
 
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...Innovation Enterprise
 
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell RubbermaidBeyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell RubbermaidInnovation Enterprise
 
CHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
CHAINalytics, Empowering Fact Based Decisions Across Your Supply ChainCHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
CHAINalytics, Empowering Fact Based Decisions Across Your Supply ChainInnovation Enterprise
 
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...Innovation Enterprise
 
One Version of the Truth, Driving S&OP from detailed planning tools, Freescale
One Version of the Truth, Driving S&OP from detailed planning tools, FreescaleOne Version of the Truth, Driving S&OP from detailed planning tools, Freescale
One Version of the Truth, Driving S&OP from detailed planning tools, FreescaleInnovation Enterprise
 
Making Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Making Sales and Operations Planning a Truly Collaborative Process, Dick LingMaking Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Making Sales and Operations Planning a Truly Collaborative Process, Dick LingInnovation Enterprise
 
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...Innovation Enterprise
 
Strengthen the Processes to reach another level of excellence, Satish Sandhir
Strengthen the Processes to reach another level of excellence, Satish SandhirStrengthen the Processes to reach another level of excellence, Satish Sandhir
Strengthen the Processes to reach another level of excellence, Satish SandhirInnovation Enterprise
 
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDAHow to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDAInnovation Enterprise
 
Cisco Strategic Planning The Journey, Cisco
Cisco Strategic Planning The Journey, CiscoCisco Strategic Planning The Journey, Cisco
Cisco Strategic Planning The Journey, CiscoInnovation Enterprise
 
Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Sales and Operations Planning, Supported by Demand Management Capability, Sus...Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Sales and Operations Planning, Supported by Demand Management Capability, Sus...Innovation Enterprise
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackInnovation Enterprise
 
Sales, Inventory & Operations Planning During High Growth, GMCR
Sales, Inventory & Operations Planning During High Growth, GMCRSales, Inventory & Operations Planning During High Growth, GMCR
Sales, Inventory & Operations Planning During High Growth, GMCRInnovation Enterprise
 
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottrPredicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottrInnovation Enterprise
 
Big Data in Education, Desire2Learn Inc
Big Data in Education, Desire2Learn IncBig Data in Education, Desire2Learn Inc
Big Data in Education, Desire2Learn IncInnovation Enterprise
 

Mehr von Innovation Enterprise (20)

Marketing Technology Organizational Models
Marketing Technology Organizational ModelsMarketing Technology Organizational Models
Marketing Technology Organizational Models
 
BI, INC - BI, INC, Boeing
BI, INC - BI, INC, BoeingBI, INC - BI, INC, Boeing
BI, INC - BI, INC, Boeing
 
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
Bridging the Gap between Budgets & Reality Oracle's Next Generation S&OP Solu...
 
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell RubbermaidBeyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
Beyond the Basics: Leveraging S&OP to Deliver Results, Newell Rubbermaid
 
CHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
CHAINalytics, Empowering Fact Based Decisions Across Your Supply ChainCHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
CHAINalytics, Empowering Fact Based Decisions Across Your Supply Chain
 
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
Sales Transformation: The Role of Sales Strategy & Operations, Dow Jones & Co...
 
One Version of the Truth, Driving S&OP from detailed planning tools, Freescale
One Version of the Truth, Driving S&OP from detailed planning tools, FreescaleOne Version of the Truth, Driving S&OP from detailed planning tools, Freescale
One Version of the Truth, Driving S&OP from detailed planning tools, Freescale
 
Making Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Making Sales and Operations Planning a Truly Collaborative Process, Dick LingMaking Sales and Operations Planning a Truly Collaborative Process, Dick Ling
Making Sales and Operations Planning a Truly Collaborative Process, Dick Ling
 
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
Building a Fast and Flexible Consumer-Driven Supply Chain, Stanley Black & De...
 
Strengthen the Processes to reach another level of excellence, Satish Sandhir
Strengthen the Processes to reach another level of excellence, Satish SandhirStrengthen the Processes to reach another level of excellence, Satish Sandhir
Strengthen the Processes to reach another level of excellence, Satish Sandhir
 
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDAHow to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
How to Keep S&OP From Getting "Stuck", Oliver Wight, JDA
 
S&OP Innovation, Marietta
S&OP Innovation, MariettaS&OP Innovation, Marietta
S&OP Innovation, Marietta
 
Cisco Strategic Planning The Journey, Cisco
Cisco Strategic Planning The Journey, CiscoCisco Strategic Planning The Journey, Cisco
Cisco Strategic Planning The Journey, Cisco
 
Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Sales and Operations Planning, Supported by Demand Management Capability, Sus...Sales and Operations Planning, Supported by Demand Management Capability, Sus...
Sales and Operations Planning, Supported by Demand Management Capability, Sus...
 
Enablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrackEnablers for Maturing your S&OP Processes, SherTrack
Enablers for Maturing your S&OP Processes, SherTrack
 
S&OP, Kinaxis
S&OP, KinaxisS&OP, Kinaxis
S&OP, Kinaxis
 
Sales, Inventory & Operations Planning During High Growth, GMCR
Sales, Inventory & Operations Planning During High Growth, GMCRSales, Inventory & Operations Planning During High Growth, GMCR
Sales, Inventory & Operations Planning During High Growth, GMCR
 
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottrPredicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
Predicting The Future With Big Data: No Crystal Ball Required, TrendSpottr
 
Big Data Toronto, Unata
Big Data Toronto, UnataBig Data Toronto, Unata
Big Data Toronto, Unata
 
Big Data in Education, Desire2Learn Inc
Big Data in Education, Desire2Learn IncBig Data in Education, Desire2Learn Inc
Big Data in Education, Desire2Learn Inc
 

Kürzlich hochgeladen

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Kürzlich hochgeladen (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Big Data: It's More Than Volume, Paypal

  • 1. BIG DATA: IT’S MORE THAN VOLUME Nachum Shacham PayPal Big Data Innovation Summit April 2013
  • 2. IT’S BIG-DATA TIME! Volume  big platforms Variety  multiple data types Velocity  fast response Value  a treasure of patterns
  • 3. TECHNOLOGY HYPE CYCLE 3 DM Tech Forum BIG DATA
  • 4. MIXED SIGNALS FROM THE PUNDITS • Data Lake • “Needle in a hay stack” • “All hay no needles” • “Yet another fad” • “Noth’n new: we’ve been analyzing data for 30 years” 4 DM Tech Forum • “Store’em and they’ll come” • “Don’t ever discard data” • “$524.752MM ROI in 3 years” • “Smart” … • “Hadoop is free” • “Just…”
  • 5. USE YOUR OWN FILTER • Sift facts from MBS • Seek factual 1-liners • See through metaphors • Discount “Smart” (data, algorithms, systems) • Be skeptical 5 DM Tech Forum
  • 6. UNLOCK THE VALUE IN BIG DATA • Data Trumps Algorithms • Sufficient data further down the long tail • Wisdom of the crowd  effective recommendations • Combine signals from different media 6 DM Tech Forum
  • 7. BUSINESS VALUE IN BIG DATA 7 DM Tech Forum RISK ANALYSIS IDENTIFY INFLUENCERS IN SOCIAL GRAPHONLINE ADS REVENUE OPTIMIZATION FRAUD DETECTION AND PREVENTION
  • 8. LET’S DIG INTO BIG DATA • Define KPIs • Explore • Model & Measure • Visualize signals • Test • Question test results • Rinse and Repeat 8 DM Tech Forum
  • 9. BIG-DATA ANALYTICS FROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS 9 MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m _000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounter s)[(FILE_BYTES_WRITTEN)
  • 10. Cloud RDBMS Data Warehouse Hadoop MPP PLATFORMS AS WORKBENCHES FOR BIG DATA AND THEIR TOOLS
  • 11. CLASSES OF ANALYTICS JOBS Big Data Data organization for BI A few large models Many small models 11 DATA MANIPULATION GRAPHICS MODEL BUILDING CROSS VALIDATION PROBLEM MR FORMULATION
  • 12. MATCH THE JOB TO THE PLATFORM
  • 14. data files process lines set sorting key and value output <key, value> Collect segment data marked by key Process segment data Output processed segment data Shuffle sort Reducer.R Mapper.py Text processing Model per segment BI-LINGUAL HADOOP STREAMING: LARGE SCALE PARALLEL PREDICTIVE MODELING
  • 15. SEMI-STRUCTURED DATA  TABULAR DATA Meta VERSION="1" . Job JOBID="job_201212150932_52151" JOBNAME=”DataFilter" USER=”user1234” SUBMIT_TIME="1355822133394" JOBCONF="hdfs://tmp/hadoop-hadoop/mapred/staging/user1234/.staging/job_201212150932_52151/job.xml" VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE=”B" . Job JOBID="job_201212150932_52151" JOB_PRIORITY="NORMAL" . Job JOBID="job_201212150932_52151" LAUNCH_TIME="1355822223576" TOTAL_MAPS="50" TOTAL_REDUCES="0" JOB_STATUS="PREP" . Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" START_TIME="1355822133148" SPLITS="" . MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051” TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" START_TIME="1355822133545" TRACKER_NAME="tracker_dn0492.ebay.com:localhost.localdomain/127.0.0.1:33613" HTTP_PORT="50060" . MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN) (FILE_BYTES_WRITTEN)(27089)]}{(org.apache.hadoop.mapred.Task$Counter) (Map-Reduce Framework)[(SPILLED_RECORDS)(Spilled Records)(0)]}" . Job JOBID="job_201212150932_52151" JOB_STATUS="RUNNING" . Task TASKID="task_201212150932_52151_m_000001" TASK_TYPE="MAP" START_TIME="1355822133163" attempt,201212171719,248176,m,000013,0,1355499674337,1355499903213,MAP,SUCCESS,default,rack3,lvsaishdc3dn0109,0109 attempt,2012121771719,248176,m,000464,0,1355501042650,1355501253259,MAP,SUCCESS,default,rack5,lvsaishdc3dn0217,0217 attempt,2012121771719,248176,m,000626,0,1355501212902,1355501366476,MAP,SUCCESS,default,rack17,lvsaishdc3dn0776,077 6 attempt,2012121771719,248176,m,001193,0,1355499673762,1355499887662,MAP,SUCCESS,default,rack8,lvsaishdc3dn0366,036 attempt,2012121771719,248176,m,001355,0,1355499673545,1355499908182,MAP,SUCCESS,default,rack9,lvsaishdc3dn0386,0386 attempt,2012121771719,248176,m,001517,0,1355501266524,1355501470527,MAP,SUCCESS,default,rack5,lvsaishdc3dn0236,0236 attempt,2012121771719,248176,m,001850,0,1355501303142,1355501486691,MAP,SUCCESS,default,rack5,lvsaishdc3dn0235,0235
  • 16. FROM TABULAR DATA TO BI 16 DM Tech Forum
  • 18. MODELS BUILT ON LARGE DATASETS 18 Meta VERSION="1" . Job JOBID="job_201112150932_52151" JOBNAME=”DataFilter" USER=”user1234” LAUNCH_TIME="1324801865576” TIME INTERVAL DATA CONCURRENCY PERCENTILES TIME SERIESWORD COUNT REPRESENTATION AVOID RAM LIMITATIONS R STAT PROCESSING
  • 19. Cloud R LEVERAGING RDBMS POWER teradataR Scidb-R
  • 20. TERADATAR FUNCTIONS (SAMPLE) Function Name What it does td.zscore Zscore Transformation td.t.paired T Test Paired td.cor Correlation Matrix td.f.oneway One way F Test td.factanal Factor Analysis td.freq Frequency Analysis td.hist Histograms td.kmeans K-Means Clustering td.ks Kolmogorov Smirnov Test td.mode Mode Value of Column td.tapply Apply a function over a database column td.summary Like R summary() td.quantiles Quantile Values td.rank Rank
  • 21. ANALYSIS OF A TABLE WITH > 1B ROWS >library(RJDBC) >library(teradataR) >tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”) > system.time(myTbldf <- td.data.frame(”myTbl")) user system elapsed 0.092 0.054 140.071 > dim(myTbldf ) [1] 1,131,670,269 9 > system.time(cor <- td.cor(myTbldf[3:9])) user system elapsed 0.021 0.003 6.722 C D E F G H I C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803 D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683 E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034 F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032 G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435 H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733 I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000
  • 22. CONCLUSION • Big data is here. See through the hype • Analyze big data to extract value • Multiple technologies & analytics tools are out there • Match platform, tools and approach • Delegate massive processing to big clusters
  • 24. BIG DATA EMPOWERS ALGORITHMS Banko & Brill “Scaling to Very Very Large Corpora for Natural Language Disambiguation”

Hinweis der Redaktion

  1. Big data is here, and corporations leverage MPP platforms like Hadoop and Teradata, for cost effective storage and processing of vast amounts of data. However, mining the business benefits of big data requires new approaches for deep analytics including predictive modeling and statistical analysis.Modeling big data requires a comprehensive process that includes noisy data of different structures, and done in parallel on large number of processorsStill need to perform the analytics tasks in a cost effective manner.We describe our experience in running statistical analysis and modeling of big data.We will review and compare the platforms we use to store and process the data.Then describe integrating processing with R, Python and SQL on Hadoop and Teradata for a range of analytics tasks. .
  2. The large volumes of data need to be stored and processed on data platforms, which are clusters of computers with vast storage and processing power.The data consist of combination of structured, semi-structured and unstructured data, that needs special processing for cleansing the data, reshaping the data for modeling, and a large set of algorithms to extract the value from the data.Big data contain sufficient amount of information for analysis of otherwise too-small segments of the market. The sheer combinations of those segments can yield a wealth of patterns that can be mined for the corporation. As more people get to view and explore the data, the more patterns will be identified, increasing the value to the corporationThus, making big data analysis feasible to large groups of people, beyond few developers, will lead to more interaction with the data hence to more benefits.
  3. Big data offer many opportunities to corporations to extract signals to guide profitable decisions.A large portion of the new big data comes from the wild in unstructured and semi-structured formatsThese data need to be cleansed and structured to enable the computation of statistical metrics and construction of predictive modelsThe volume and format and the wealth of analysis tasks requires application of different tools and environments to store and process the data.The patterns and signals in the data are more likely to be extracted when large number of analysts are given access and can construct their own models.Thus, make the tools available and accessible to the many.
  4. The most common architecture for big data is MPP.RDBMS and Hadoop are the most common architectures.They are similar in employing a large number of processors and disks and distributing the processing to where the data areRDBMS and Hadoop offer different programming environments and performance characteristics.Companies are increasingly deploying both platforms to accommodate a wide spectrum of business analytics needs.When supporting multiple concurrent user jobs they have to deliver not only data and computation but also quality of service that match users’ expectations.How to allocate workloads to platforms to maximize value is an area of active research.A large number of programming languages and tools have been developed for these platforms. Java, PIG, Hive, and scala are powerful tools that many organizations have adopted.We have found Python and R to be particularly attractive to the analytics tasks that we are performing. They are well established languages that many analysts have been using for years on smaller datasets. When combined in the the Streaming frameworks, R and Python can be used to create models quickly and in code that is clear and concise. Their packages provide many models and processing tasks out of the box.Teradata offers strong SQL implementation with many extension UDFs designed for processing of semi-structured data in textual format.An R package was recently published that enables using the processing power of the cluster for many statistical functions for running on massive datasets
  5. This table compares the platforms based on the types of processing tasks. For example, scanning large tables of text is most suitable for Hadoop whereas jobs that modify tables or search based on primary index are more efficiently performed in TD.Special functions can be written more easily for Hadoop whereas join to 2 large tables is more easily done on TD.When data are replicated across multiple platforms, such tables are used to decide on the best platform to run particular jobs.
  6. We now turn to the topic of creating and running the actual analytics tasks. R is a powerful language that was designed for data analysis and statistical modeling. It has functions and packages for processing data at all the steps of the data analysis cycles: from sourcing the data from RDBMS, flat files, or the web through data preparation, exploratory data analysis, model creation for all imaginable statistical test or algorithm, DOE, model validation, variable selection, all the way to creation of charge and graphs for presenting the results R is gaining in popularity and has been place in the top 20 programming languages. However, in our experience we found Python to be more effective in text processing.which calls for using both languages in Hadoop tasks.
  7. On Hadoop, the Streaming framework enables us to run mapper and reducer in different languages. In this environment, the mapper is written in Python and the reducer in R.The cleansed and filtered map data is send to the framwork with proper keys that deliver to the reducer the data in logical chunks, each of each is considered as a statistical dataset, in the form of data.frame.The model is built on these data frames in the same way it has been traditionally done in R. Only in Hadoop, all reducers perform these task in parallel.