SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Using Machine learning and R
Finding Order in the
Chaos
Harshad Saykhedkar
The main ideaSource of text and applications
Emails Spam detection
Product descriptions /
reviews
Sentiment analysis,
recommendation
Blogs / informational
content
Content
recommendations
Web pages / news
articles
Topic identification,
trending topics
Tweets / comments /
social content
Sentiment analysis,
named entity recognition
(Text mining) is a wonderful
world. Let's go exploring...!
The main ideaThe main idea
Itinerary
● R you ready ?
● Prep camp
● The wandering traveller
● The seeker
R you ready ?
The main ideaPacking our bags : Checks
● Starting R
● Loading required packages
● Check sessionInfo( )
The main ideaPacking our bags : Datatypes
Atomic
Vector
Lists
"Let's try our hands"
The main ideaPacking our bags : Functions
● Expressions which are evaluated
● Can be passed around
● Definitions can be nested
Details not covered : Argument matching, Call by value,
Environments and lexical scoping, Promises etc..
Prep Camp
The main ideaPrep camp : Sentiment Analysis
● Bag of words model
● Simple aggregated score
' terrible service & disorganised '
' OK - some good some bad '
' Great location, fabulous staff '
The main idea
● Part of speech ambiguity
● Further exploration ?
● Equal weightage model
● Double negations ?
Prep camp : Improvements
The Wandering
Traveller
The main ideawandering traveller : Unsupervised Learning
Can define
distance
Entity as
point in
space
How to derive this model for text ?
Feature 1
Feature 2
The main ideawandering traveller : Vector Space Model
Word,
Phrase,
Theme
Comments,
Blogs,
Tweets
Word,
Phrase,
Theme
The main ideawandering traveller : TfIdf and other details
" But how to measure the importance of
a word for a doc ? "
● Binary : Is the 'word' in the 'doc' ?
● Tf : # times the word in the 'doc' ?
● TfIdf : Penalize the obvious!
The main ideawandering traveller : Hierarchical Clustering
● Define distance measure
● Keep Merging based on similarity
Washing
Machine
Washer
Dryer
Camera
The main ideawandering traveller : Improvements
● Stemming, lemmatization
● Latent semantic analysis
"Cameras" Vs "Camera"
"Phone" "Touch Screen"
The Seeker
The main ideaSeeker : Supervised Learning
● Labels given with features
● Find rule, classify unobserved case
Feature 1
Feature 2
The main ideaSeeker : Naive Bayes Classifier
● Independence of features
● Train the model on training set
● Test accuracy on a holdout sample
Predicted 0 Predicted 1
Actual 0 F (0, 0) F(0, 1)
Actual 1 F (1, 0) F(1, 1)
Learnings
The main ideaLearnings
● How to cleanup and preprocess data
in text form ?
● How to model the data ?
● How to cluster the data ?
● How to classify the data ?
The main ideaSource of text and applications
Emails Spam detection
Product descriptions /
reviews
Sentiment analysis,
recommendation
Blogs / informational
content
Content
recommendations
Web pages / news
articles
Topic identification,
trending topics
Tweets / comments /
social content
Sentiment analysis,
named entity recognition
Questions ?
"Avid R learner, trying to apply bunch of these
techniques to the digital ads world"
Contact
harshad.saykhedkar@sokrati.com
The main ideaAbout me

Weitere ähnliche Inhalte

Ähnlich wie Machine learning applications on text data

OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 

Ähnlich wie Machine learning applications on text data (20)

Code Quality Makes Your Job Easier
Code Quality Makes Your Job EasierCode Quality Makes Your Job Easier
Code Quality Makes Your Job Easier
 
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, LucidworksA Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
A Multifaceted Look At Faceting - Ted Sullivan, Lucidworks
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Scaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningScaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine Learning
 
ML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdfML MODULE 1_slideshare.pdf
ML MODULE 1_slideshare.pdf
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Machine learning: A Walk Through School Exams
Machine learning: A Walk Through School ExamsMachine learning: A Walk Through School Exams
Machine learning: A Walk Through School Exams
 
Taming Text
Taming TextTaming Text
Taming Text
 
Cracking the coding interview columbia - march 23 2011
Cracking the coding interview   columbia - march 23 2011Cracking the coding interview   columbia - march 23 2011
Cracking the coding interview columbia - march 23 2011
 
Machine Learning: Expertise On-Demand
Machine Learning: Expertise On-DemandMachine Learning: Expertise On-Demand
Machine Learning: Expertise On-Demand
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 
Deprecating the state machine: building conversational AI with the Rasa stack
Deprecating the state machine: building conversational AI with the Rasa stackDeprecating the state machine: building conversational AI with the Rasa stack
Deprecating the state machine: building conversational AI with the Rasa stack
 
Dato Keynote
Dato KeynoteDato Keynote
Dato Keynote
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
12. Objects I
12. Objects I12. Objects I
12. Objects I
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Machine learning applications on text data

  • 1. Using Machine learning and R Finding Order in the Chaos Harshad Saykhedkar
  • 2. The main ideaSource of text and applications Emails Spam detection Product descriptions / reviews Sentiment analysis, recommendation Blogs / informational content Content recommendations Web pages / news articles Topic identification, trending topics Tweets / comments / social content Sentiment analysis, named entity recognition
  • 3. (Text mining) is a wonderful world. Let's go exploring...! The main ideaThe main idea
  • 4. Itinerary ● R you ready ? ● Prep camp ● The wandering traveller ● The seeker
  • 6. The main ideaPacking our bags : Checks ● Starting R ● Loading required packages ● Check sessionInfo( )
  • 7. The main ideaPacking our bags : Datatypes Atomic Vector Lists "Let's try our hands"
  • 8. The main ideaPacking our bags : Functions ● Expressions which are evaluated ● Can be passed around ● Definitions can be nested Details not covered : Argument matching, Call by value, Environments and lexical scoping, Promises etc..
  • 10. The main ideaPrep camp : Sentiment Analysis ● Bag of words model ● Simple aggregated score ' terrible service & disorganised ' ' OK - some good some bad ' ' Great location, fabulous staff '
  • 11. The main idea ● Part of speech ambiguity ● Further exploration ? ● Equal weightage model ● Double negations ? Prep camp : Improvements
  • 13. The main ideawandering traveller : Unsupervised Learning Can define distance Entity as point in space How to derive this model for text ? Feature 1 Feature 2
  • 14. The main ideawandering traveller : Vector Space Model Word, Phrase, Theme Comments, Blogs, Tweets Word, Phrase, Theme
  • 15. The main ideawandering traveller : TfIdf and other details " But how to measure the importance of a word for a doc ? " ● Binary : Is the 'word' in the 'doc' ? ● Tf : # times the word in the 'doc' ? ● TfIdf : Penalize the obvious!
  • 16. The main ideawandering traveller : Hierarchical Clustering ● Define distance measure ● Keep Merging based on similarity Washing Machine Washer Dryer Camera
  • 17. The main ideawandering traveller : Improvements ● Stemming, lemmatization ● Latent semantic analysis "Cameras" Vs "Camera" "Phone" "Touch Screen"
  • 19. The main ideaSeeker : Supervised Learning ● Labels given with features ● Find rule, classify unobserved case Feature 1 Feature 2
  • 20. The main ideaSeeker : Naive Bayes Classifier ● Independence of features ● Train the model on training set ● Test accuracy on a holdout sample Predicted 0 Predicted 1 Actual 0 F (0, 0) F(0, 1) Actual 1 F (1, 0) F(1, 1)
  • 22. The main ideaLearnings ● How to cleanup and preprocess data in text form ? ● How to model the data ? ● How to cluster the data ? ● How to classify the data ?
  • 23. The main ideaSource of text and applications Emails Spam detection Product descriptions / reviews Sentiment analysis, recommendation Blogs / informational content Content recommendations Web pages / news articles Topic identification, trending topics Tweets / comments / social content Sentiment analysis, named entity recognition
  • 25. "Avid R learner, trying to apply bunch of these techniques to the digital ads world" Contact harshad.saykhedkar@sokrati.com The main ideaAbout me