SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
Page 1
Data Science: A view from the trenches
Ram Sriharsha
Twitter: @halfbrane
Vinay Shukla
Twitter: @neomythos
Page 2
Agenda
•  Problems we work on
•  Common Challenges
•  Reductions
•  Handling label sparsity
–  Co Training
–  Adaptive Learning
•  When you have to be fast and accurate
–  Online Clustering
–  Sketches
–  Online Learning
•  Visualization
Page 3
Some Problems
• Search Advertising
– Click Prediction: Given a query, ad and user context, how likely is the user to click on ad?
– Feature Engineering: Query/ ad categorization, query -> feature vector
• Entity Resolution and Disambiguation
• Over / Under Payment of claims detection
• Document Matching
• Login Risk Detection
Page 4
Common Challenges
•  Labeling is expensive and not clean
– Selectively ask for labels (active learning)
– Co-Training to expand label set
•  Not enough high quality implementations of algorithms
– Modular extensions of base implementations (Reductions)
– Boosting
•  Speed of training/ scoring important
– Online learning
– Online clustering
– Sketches
•  Freshness of models
– Online and adaptive learning
•  Visualizing performance and feature importance
– Zeppelin
Page 5
Reductions
OVR
Let R = rejection sampling algorithm
For each example h, sample according to
Cost of h and feed to 0/1 classifier
A
A
…
Randomize over classifiers that
Output yes
Importance Weighting R A
R^-1
Let A = Algorithm for optimizing 0/1 loss
Page 6
Active Learning
• Given a pool of examples determine which ones is the classifier least confident about
• Ask those examples to be labeled, and feed to training
• Choose query points that shrink the space of classifiers rapidly
• Exploit natural structure in data
45% 45%2.5% 2.5%5%
Page 7
Co Training
• Suppose you have two “views” of the data
– e.g, web pages have content, and hyperlinks pointing to and from them
– Suppose problem is to label webpage as about literature/ or not (binary classification)
• One approach:
– Label web pages manually. Train classifier to use both content text and hyperlinks as
features
– This requires a large # of labeled pages
• Other approach:
– Since we have two views , try to learn two classifiers
– Each classifier learns on a subset of labeled examples.
– The scores of each classifier are used to label a subset of unlabeled web pages and extend
the labels for the other classifier.
Page 8
Sketches
• Store a “summary” of the dataset
• Querying the sketch is “almost” as good as querying the dataset
• Example: frequent items in a stream
– Initialize associative array A of size k-1
– Process: for each j
-  if j is in keys(A), A[j] += 1
-  else if |keys(A)| < k - 1, A[j] = 1
-  else
–  for each l in keys(A),
»  A(l) -=1 ;
»  if A(l) = 0, remove l;
–  done
Page 9
Clustering is not fast enough
• Sample and then cluster
• Do clusters need to dynamically adapt?
– Online clustering
– Streaming K Means
Page 10
K Means
• Initialize cluster centers somehow
– random
– K means ++
• Alternate
– Assign each point to closest cluster center
– Move cluster center to average of points assigned to center
• Stop when convergence criteria reached
– Points don’t move “much”
– Number of iterations reached.
Page 11
11
Initialize Cluster Centers
k1
k2
k3
X
Y
Pick 3
initial
cluster
centers
(randomly)
Page 12
Assign each point
k1
k2
k3
X
Y
Assign
each point
to the closest
cluster
center
Page 13
Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k1
k2
k2
k1
k3
k3
Page 14
Streaming K Means
• For each new point
– Assign to closest cluster center
– Update cluster center to incrementally move in direction of new point
• Online version of Lloyd’s algorithm
• Good enough in practice
Page 15
Recompute Cluster Centers
X
Y
Move
each cluster center
to the mean
of each cluster
k2
k1
k3
Page 16
Online Clustering (Liberty, Sriharsha, Sviridenko)
• Initialization Phase:
– First point is its own cluster
– Pick some Normalization factor f
• Update Phase for point p:
– Let d = distance from p to closest center so far
– With probability proportional to d/ f , attach p to closest center
– With probability max (1 – d/f, 1), form a new cluster center at p.
• Merge Phase:
– Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters
Page 17
Properties
• Provably close to optimal in online setting
• Does not open more than O(log(OPT)) clusters pays O(OPT) cost
• Very efficient to implement
• Adaptive algorithm
• Forgetfulness can be introduced in the merge process
• Leaving out the merge process still produces a clustering that might be indicative of
structure, i.e useful as a machine learning feature
Page 18
My classifier is not fast enough
• Even for batch problems online learning might be good enough!
• For real time problems, online learning or incremental learning is needed.
Page 19
What is online learning?
• Batch Learning:
– Classifier sees a set of labeled examples, and trains a model
– Predicts on trained model for unseen examples
• Online Learning:
– Classifier sees an example at a time.
– Limited look back window (often 0)
– Predicts on example and is revealed the cost
– Learns from mistake
– Yields a batch learning algorithm that is one pass: simply run online algorithm for each
example in a batch.
Page 20
Challenges of online learning
• normalization
– In batch set up, can normalize data by making a pass over the full dataset
– In online setting, cannot make a second pass
– Solution: Adaptive normalization
• Late arriving features
– In Batch setting, all features are recorded in the dataset
– In online setting different features may arrive at different times
– Solution: Adagrad (Adaptive gradient technique)
• Stochastic Gradient Descent convergence can be slow
– More data helps
– Adaptive normalization improves convergence
– Adagrad improves convergence and reduces sensitivity to step size
Page 21
Visualization
• Speed up feature discovery
• Intuitive visualization of model performance
• Improve debuggability
Page 22
The Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript
Page 23
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page 24
Zeppelin today in Data Science Workflow…
What is the
question I'm
answering?
What data will
I need?
Plan
Acquire
the data
Analyze data
quality
Reformat
Impute
etc
Clean Data
Analyze data
Visualize
Create model
Evaluate
results
Create
features
Create report
Deploy in
Production
Publish
& Share
Start
here
End
here
Script
VisualizeScript
Page 25
Zeppelin – Road Ahead
Operations
-  Deploy to the cluster with Ambari
Security
-  Authentication against LDAP
-  SSL
-  Run in Kerberized Cluster
-  Authorization of notebooks
Sharing/ Collaboration
-  Share selected notebooks with selected
users/groups
-  Ability to read/publish notebooks to github
Data Import
- Visual data import/download
- Clean data as it comes
Usability
-  Summary Data – See column summary
-  Keyboard shortcuts, Auto complete, syntax high
light, line numbers
Visualization
-  Pluggable visualization & more charts, maps &
tables.
R support
-  Harden SparkR interpreter
Enterprise ReadyEase of Use
Page 26
Upcoming Work
• Entity Resolution package GA
– Supports Entity Graph based resolution
– Includes Random Walk algorithm for computing similarity score
• Online learning and clustering Spark Packages
• Contribute more Reduction algorithms to Spark ML
– Cost Sensitive Classification
– Filter tree based Multiclass Reduction
• Zeppelin GA
Page 27
Thank you!
• Ram Sriharsha
@halfabrane
• Vinay Shukla
@neomythos

Weitere ähnliche Inhalte

Was ist angesagt?

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenDatabricks
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriDatabricks
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingJen Aman
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDatabricks
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...DataStax Academy
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 

Was ist angesagt? (20)

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Huawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark StreamingHuawei Advanced Data Science With Spark Streaming
Huawei Advanced Data Science With Spark Streaming
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Designing Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache SparkDesigning Distributed Machine Learning on Apache Spark
Designing Distributed Machine Learning on Apache Spark
 
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowImproving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 

Ähnlich wie Apache con big data 2015 - Data Science from the trenches

Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsYalçın Yenigün
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemPierre Gutierrez
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyIndiana Online Users Group
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendationsRik Van Bruggen
 
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with GraphsNeo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with GraphsNeo4j
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduceGareth Rogers
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - SlidesAditya Joshi
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesNish Parikh
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningitstuff
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for EveryoneAly Abdelkareem
 

Ähnlich wie Apache con big data 2015 - Data Science from the trenches (20)

Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
From Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender systemFrom Labelling Open data images to building a private recommender system
From Labelling Open data images to building a private recommender system
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled Technology
 
Using graphs for recommendations
Using graphs for recommendationsUsing graphs for recommendations
Using graphs for recommendations
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with GraphsNeo4j GraphTalk Basel - Building intelligent Software with Graphs
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Metail and Elastic MapReduce
Metail and Elastic MapReduceMetail and Elastic MapReduce
Metail and Elastic MapReduce
 
Analytics Boot Camp - Slides
Analytics Boot Camp - SlidesAnalytics Boot Camp - Slides
Analytics Boot Camp - Slides
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
 
IEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slidesIEEE.BigData.Tutorial.2.slides
IEEE.BigData.Tutorial.2.slides
 
Large scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log miningLarge scale Click-streaming and tranaction log mining
Large scale Click-streaming and tranaction log mining
 
Machine Learning for Everyone
Machine Learning for EveryoneMachine Learning for Everyone
Machine Learning for Everyone
 

Kürzlich hochgeladen

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Kürzlich hochgeladen (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Apache con big data 2015 - Data Science from the trenches

  • 1. Page 1 Data Science: A view from the trenches Ram Sriharsha Twitter: @halfbrane Vinay Shukla Twitter: @neomythos
  • 2. Page 2 Agenda •  Problems we work on •  Common Challenges •  Reductions •  Handling label sparsity –  Co Training –  Adaptive Learning •  When you have to be fast and accurate –  Online Clustering –  Sketches –  Online Learning •  Visualization
  • 3. Page 3 Some Problems • Search Advertising – Click Prediction: Given a query, ad and user context, how likely is the user to click on ad? – Feature Engineering: Query/ ad categorization, query -> feature vector • Entity Resolution and Disambiguation • Over / Under Payment of claims detection • Document Matching • Login Risk Detection
  • 4. Page 4 Common Challenges •  Labeling is expensive and not clean – Selectively ask for labels (active learning) – Co-Training to expand label set •  Not enough high quality implementations of algorithms – Modular extensions of base implementations (Reductions) – Boosting •  Speed of training/ scoring important – Online learning – Online clustering – Sketches •  Freshness of models – Online and adaptive learning •  Visualizing performance and feature importance – Zeppelin
  • 5. Page 5 Reductions OVR Let R = rejection sampling algorithm For each example h, sample according to Cost of h and feed to 0/1 classifier A A … Randomize over classifiers that Output yes Importance Weighting R A R^-1 Let A = Algorithm for optimizing 0/1 loss
  • 6. Page 6 Active Learning • Given a pool of examples determine which ones is the classifier least confident about • Ask those examples to be labeled, and feed to training • Choose query points that shrink the space of classifiers rapidly • Exploit natural structure in data 45% 45%2.5% 2.5%5%
  • 7. Page 7 Co Training • Suppose you have two “views” of the data – e.g, web pages have content, and hyperlinks pointing to and from them – Suppose problem is to label webpage as about literature/ or not (binary classification) • One approach: – Label web pages manually. Train classifier to use both content text and hyperlinks as features – This requires a large # of labeled pages • Other approach: – Since we have two views , try to learn two classifiers – Each classifier learns on a subset of labeled examples. – The scores of each classifier are used to label a subset of unlabeled web pages and extend the labels for the other classifier.
  • 8. Page 8 Sketches • Store a “summary” of the dataset • Querying the sketch is “almost” as good as querying the dataset • Example: frequent items in a stream – Initialize associative array A of size k-1 – Process: for each j -  if j is in keys(A), A[j] += 1 -  else if |keys(A)| < k - 1, A[j] = 1 -  else –  for each l in keys(A), »  A(l) -=1 ; »  if A(l) = 0, remove l; –  done
  • 9. Page 9 Clustering is not fast enough • Sample and then cluster • Do clusters need to dynamically adapt? – Online clustering – Streaming K Means
  • 10. Page 10 K Means • Initialize cluster centers somehow – random – K means ++ • Alternate – Assign each point to closest cluster center – Move cluster center to average of points assigned to center • Stop when convergence criteria reached – Points don’t move “much” – Number of iterations reached.
  • 11. Page 11 11 Initialize Cluster Centers k1 k2 k3 X Y Pick 3 initial cluster centers (randomly)
  • 12. Page 12 Assign each point k1 k2 k3 X Y Assign each point to the closest cluster center
  • 13. Page 13 Recompute Cluster Centers X Y Move each cluster center to the mean of each cluster k1 k2 k2 k1 k3 k3
  • 14. Page 14 Streaming K Means • For each new point – Assign to closest cluster center – Update cluster center to incrementally move in direction of new point • Online version of Lloyd’s algorithm • Good enough in practice
  • 15. Page 15 Recompute Cluster Centers X Y Move each cluster center to the mean of each cluster k2 k1 k3
  • 16. Page 16 Online Clustering (Liberty, Sriharsha, Sviridenko) • Initialization Phase: – First point is its own cluster – Pick some Normalization factor f • Update Phase for point p: – Let d = distance from p to closest center so far – With probability proportional to d/ f , attach p to closest center – With probability max (1 – d/f, 1), form a new cluster center at p. • Merge Phase: – Once sufficient clusters have opened up, or sufficient cost accumulated, merge clusters
  • 17. Page 17 Properties • Provably close to optimal in online setting • Does not open more than O(log(OPT)) clusters pays O(OPT) cost • Very efficient to implement • Adaptive algorithm • Forgetfulness can be introduced in the merge process • Leaving out the merge process still produces a clustering that might be indicative of structure, i.e useful as a machine learning feature
  • 18. Page 18 My classifier is not fast enough • Even for batch problems online learning might be good enough! • For real time problems, online learning or incremental learning is needed.
  • 19. Page 19 What is online learning? • Batch Learning: – Classifier sees a set of labeled examples, and trains a model – Predicts on trained model for unseen examples • Online Learning: – Classifier sees an example at a time. – Limited look back window (often 0) – Predicts on example and is revealed the cost – Learns from mistake – Yields a batch learning algorithm that is one pass: simply run online algorithm for each example in a batch.
  • 20. Page 20 Challenges of online learning • normalization – In batch set up, can normalize data by making a pass over the full dataset – In online setting, cannot make a second pass – Solution: Adaptive normalization • Late arriving features – In Batch setting, all features are recorded in the dataset – In online setting different features may arrive at different times – Solution: Adagrad (Adaptive gradient technique) • Stochastic Gradient Descent convergence can be slow – More data helps – Adaptive normalization improves convergence – Adagrad improves convergence and reduces sensitivity to step size
  • 21. Page 21 Visualization • Speed up feature discovery • Intuitive visualization of model performance • Improve debuggability
  • 22. Page 22 The Data Science Workflow… What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript
  • 23. Page 23 Introducing Apache Zeppelin Web-based Notebook for interactive analytics Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 24. Page 24 Zeppelin today in Data Science Workflow… What is the question I'm answering? What data will I need? Plan Acquire the data Analyze data quality Reformat Impute etc Clean Data Analyze data Visualize Create model Evaluate results Create features Create report Deploy in Production Publish & Share Start here End here Script VisualizeScript
  • 25. Page 25 Zeppelin – Road Ahead Operations -  Deploy to the cluster with Ambari Security -  Authentication against LDAP -  SSL -  Run in Kerberized Cluster -  Authorization of notebooks Sharing/ Collaboration -  Share selected notebooks with selected users/groups -  Ability to read/publish notebooks to github Data Import - Visual data import/download - Clean data as it comes Usability -  Summary Data – See column summary -  Keyboard shortcuts, Auto complete, syntax high light, line numbers Visualization -  Pluggable visualization & more charts, maps & tables. R support -  Harden SparkR interpreter Enterprise ReadyEase of Use
  • 26. Page 26 Upcoming Work • Entity Resolution package GA – Supports Entity Graph based resolution – Includes Random Walk algorithm for computing similarity score • Online learning and clustering Spark Packages • Contribute more Reduction algorithms to Spark ML – Cost Sensitive Classification – Filter tree based Multiclass Reduction • Zeppelin GA
  • 27. Page 27 Thank you! • Ram Sriharsha @halfabrane • Vinay Shukla @neomythos