SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Tag Extraction
George McBay, Naoki Nakatani
San Jose State University
CS185C Spring 2014
Agenda
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Agenda
Problem
Given question with title and body, can we
automatically generate tags for it?
Where can I find the
LaTeX3 manual?
Few month ago I saw a big pdf-manual of all
LaTeX3-packages and the new syntax. I think
it was bigger than 300 pages. I can't find it on
the web.
Does anyone have a link?
Documentation
latex3
expl3
Dataset
File :
● Train.csv
● Test.csv
Fields :
● id, title, body, tags (Train)
● id, title, body (Test)
Characteristics :
● Quoted csv
● Body contains n
● Tags separated by space
● Entry delimited by 0
0
“----” , ”-----------” , “------------------------ “--- --- --- ---”
------------------------” , 0
0
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------ “--- --- --- ---”
------------------------” ,
Working Environment
● Mac OS 10.9.1
● Apache Hadoop 1.2.1
● Apache Mahout 0.8
● Apache Spark 0.9
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Agenda
ETL
Extract : Assume data is extracted from website
Transform : Use OpenCSV
1. Remove whitespaces (‘ ’, ‘n’, ‘t’)
2. Combine fields with ‘t’
3. Write to tsv file
Load : Upload to HDFS
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Agenda
Data Analysis
Tag Occurrence Count
TSV File
Map-Reduce
• Input : <index, question>
• Mapper output :
<tag, 1> for each tag
• Reducer output :
<tag, count> for each
tag
7785 c#
6788 java
6575 php
6135 javascript
5317 android
4949 jquery
3278 c++
3082 python
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Agenda
Question Filtering for ML
TSV File
Map-Reduce
• Input : <index, question>
• Mapper output :
<index, question> if question
contains top5 tag
• Reducer output :
<index, question>
TSV File
with questions
that has one of
top5 tags
Machine Learning
● Problem
○ Can we classify questions into one of 5 categories
(tags) ?
Classification
● Naive Bayes Classifier
● Detail in Mahout Classification Presentation
Machine Learning
Correctly Classified Instances : 10209 81.8816%
Incorrectly Classified Instances : 2259 18.1184%
Total Classified Instances : 12468
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Agenda
Title vs Body
Intuitively…
Title is a short summary describing the body of the question
⇒ Title must be more important than body!
How to put more emphasis on title?
● Build separate models for title & body + more weight for
title model?
● Prepend title several times and feed into regular model?
Two models approach
Title model not accurate…
● Too short for model to
distinguish labels
● Longer text wins!
Repeated title approach
Slight improvement!
● Testing against train-set
~ 93% ⇒ ~ 95%
● Testing against test-set
~ 80% ⇒ ~ 82%
Multiple title
● more stop words ⇒ No effect
● more keywords (if title has)
Agenda
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Diving into model
● Top 10 words from each category
● Popular (redundant) words
showing up in all categories (I, it,
code, etc)
BUT
● Some words specific to each
category (activity for android,
jquery for javascript, echo for php)
Which words to drop?
Word count against TrainSmall.tsv?
● Total count : 19276034
Top 5:
● p - 827029
● the - 545950
● i - 476056
● to - 393027
● a - 362328
Problem
● Key words have high count too
○ 39th - http - 51412wc
○ 63rd - java - 35076wc
○ 91st - php - 25135wc
Can’t even throw away first 100
words...
Which words to drop?
Word count against ordinary english text?
● 20 books from gutenberg.org
● Total count : 1041565
● A lot less technical! (only 4wc for java,
probably an island from Indonesia?)
● Safe to throw away 1959 words (> 50wc)
BUT
Not much improvement...
● Due to tf-idf measurement
○ Less weight for words appearing in many documents
○ More weight for words appearing only in specific
documents
Agenda
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Any room for improvement?
What is the source of error?
● android ⇔ java ==> both java
● javascript ⇔ php ===> both web-related
● java classified as c# ===> many questions have both tags
Any room for improvement?
No problem if we can give multiple labels
to one question!
Multi-label classification
● Modification from previous classification task
○ Top5 tags ⇒ Top1000 tags
○ 1 tag for 1 question ⇒ 5 tags for 1 question
(Pick 5 most probable tags)
○ 1 question learned only once ⇒ 1 question with
multiple tags learned multiple times
tag1 body
tag2 body
model
Good outcome (Example 1)
TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x
BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x
for testing. I have read some articles about it but it seems that those are too old
and cannot work. </p> <p>Does anyone have some experience in doing this or
does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags
● iphone
● ios
● upgrade
Predicted tags
● iphone
● ios
● osx
● objective-c
● php
GREAT outcome (Example 2)
TITLE: Is it possible to display an image in text field in html?
BODY: <p>Can we display image inside a text field in <code>html</code>?
</p> <p><strong>Edit</strong></p> <p>What I want to do is to have an
<code>editable</code> area, and want to add <code>html</code> objects
inside it(i.e. button, image ..etc)</p>
Actual tags
● javascript
● jquery
● html
● css
● web
Predicted tags
● javascript
● jquery ===> Never appears in text!
● html
● c#
● php
Stats
Row : # actual tags assigned to one question
Col : # predicted tags which are also in actual tag set
[Ex] Out of total 32798 questions which have 2 tags:
● For 14541 questions, model suggested both 2 actual tags.
● For 13922 questions, model suggested 1 of 2 actual tags.
● For 4335 questions, model couldn’t suggest the correct tags.
How to evaluate
Generous evaluator
If model gets at least 1 correct, approve it!
Total accuracy = 83.55% (B)
How to evaluate
Strict evaluator
Never approve unless model gets all correct!
Total accuracy = 43.04% (F)
Conclusion for performance
● Overall, good!
○ Predicted tag set is relatively close to the actual tag
set (Apple-related, Web-related)
● but, not there yet...
○ Almost impossible to distinguish versions (c#-3.0,
c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api,
facebook-like ⇒ facebook)
○ Still showing unrelated tags (php python
everywhere!)
Agenda
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Spark
Advantages:
- Easy to get started with
- Interactive shell
- Less code to write
Spark
Disadvantages:
- Not many reference for MLlib
- Still new
Spark
● Used PySpark which is python interface to
using Spark
● Implemented ML model from ground-up
using python dictionaries and mapreduce
procedure
How It Works
5 basic procedures used:
● map
● flatMap
● reduce
● reduceByKey
● collectAsMap
How It Works
key_val = line.flatMap(~).map(~)
key_val = key_val.reduceByKey(~)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2)
LINE
How It Works
dict = key_val.collectAsMap()
{a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)
How It Works
Model:
- statistical model
- matrix of weights
- uses tf-idf
How It Works
Tags
How It Works
Tags
Words from document
How It Works
Tags Relevance
Words from document
How It Works
Implemented as → { tag : { word : wight } }
How It Works
● Most relevant tag chosen by sum of weights
associated to words contained in the
document
How It Works
Now, how are the weights calculated?
● First calculate idf (inverse document
frequency) for each word
● Next calculate tf (term frequency) associated
with each tag
● Multiply idf to each entry then Normalize
How It Works
idf for a word
defined by:
idf(word) = log(D/F(word))
where,
D = total # of doc in the training set
F(word) = # of doc which contains word
How It Works
Two ways to calculate tf:
1) number of times you see the term
associated with a tag
2) number of documents you see the term
associated with a tag (in other words only
count one time per doc)
Results
TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x
BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x
for testing. I have read some articles about it but it seems that those are too old
and cannot work. </p> <p>Does anyone have some experience in doing this or
does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags
● iphone
● ios
● upgrade
Predicted tags
● ios4.3
● iphone-3gs
● cocoa-touch
● ios4
● upgrade
Results
TITLE: Is it possible to display an image in text field in html?
BODY: <p>Can we display image inside a text field in <code>html</code>?
</p> <p><strong>Edit</strong></p> <p>What I want to do is to have an
<code>editable</code> area, and want to add <code>html</code> objects
inside it(i.e. button, image ..etc)</p>
Actual tags
● javascript
● jquery
● html
● css
● web
Predicted tags
● html
● img
● alignment
● get
● web
Results
Top:
Predicted
Below:
Actual
Results
● Not perfect
● But very close
● Relevant words for tags look right
Results
most relevant words for tag “python”:
[u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ]
most relevant words for tag “math”:
[u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example',
u'matlab', ... ]
Adjusting
What can be adjusted?
● Pretty much anything!
● I tried playing with: tf, idf, tag_frequency,
normalization, cleaning text, etc.
Conclusion
● Adjusting the metrics to get the right model
can be time consuming (many things can be
adjusted)!
● But still, Naive Bayes algorithm is very suited
for keyword extraction problem (and text
classification in general), because of how tf-
idf is defined.
Tag Extraction Final Presentation - CS185CSpring2014

Weitere ähnliche Inhalte

Was ist angesagt?

Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Preetha Chatterjee
 
Mining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software ArtifactsMining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software ArtifactsPreetha Chatterjee
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Avkash Chauhan
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006pogil
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
 
Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParser
Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParserParadigm Wars: Object Oriented Vs Functional Programming in creating MarkParser
Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParserRohit Arora
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitBen Healey
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
A New Paradigm for Alignment Extraction
A New Paradigm for Alignment ExtractionA New Paradigm for Alignment Extraction
A New Paradigm for Alignment Extractioncmeilicke
 
L2 datatypes and variables
L2 datatypes and variablesL2 datatypes and variables
L2 datatypes and variablesRavi_Kant_Sahu
 

Was ist angesagt? (14)

Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...Finding Help with Programming Errors: An Exploratory Study of Novice Software...
Finding Help with Programming Errors: An Exploratory Study of Novice Software...
 
Mining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software ArtifactsMining Code Examples with Descriptive Text from Software Artifacts
Mining Code Examples with Descriptive Text from Software Artifacts
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParser
Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParserParadigm Wars: Object Oriented Vs Functional Programming in creating MarkParser
Paradigm Wars: Object Oriented Vs Functional Programming in creating MarkParser
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language Toolkit
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
A New Paradigm for Alignment Extraction
A New Paradigm for Alignment ExtractionA New Paradigm for Alignment Extraction
A New Paradigm for Alignment Extraction
 
L2 datatypes and variables
L2 datatypes and variablesL2 datatypes and variables
L2 datatypes and variables
 
C++programing
C++programingC++programing
C++programing
 

Andere mochten auch

MrKNN_Soft Relevance for Multi-label Classification
MrKNN_Soft Relevance for Multi-label ClassificationMrKNN_Soft Relevance for Multi-label Classification
MrKNN_Soft Relevance for Multi-label ClassificationYI-JHEN LIN
 
Multi-label, Multi-class Classification Using Polylingual Embeddings
Multi-label, Multi-class Classification Using Polylingual EmbeddingsMulti-label, Multi-class Classification Using Polylingual Embeddings
Multi-label, Multi-class Classification Using Polylingual EmbeddingsGeorge Balikas
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Multi-Class Classification on Cartographic Data(Forest Cover)
Multi-Class Classification on Cartographic Data(Forest Cover)Multi-Class Classification on Cartographic Data(Forest Cover)
Multi-Class Classification on Cartographic Data(Forest Cover)Abhishek Agrawal
 
Voting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label ClassificationVoting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label ClassificationDaniele Loiacono
 
Svm implementation for Health Data
Svm implementation for Health DataSvm implementation for Health Data
Svm implementation for Health DataAbhishek Agrawal
 
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...Toshiki Sakai
 
Naïve multi label classification of you tube comments using
Naïve multi label classification of you tube comments usingNaïve multi label classification of you tube comments using
Naïve multi label classification of you tube comments usingNidhi Baranwal
 

Andere mochten auch (8)

MrKNN_Soft Relevance for Multi-label Classification
MrKNN_Soft Relevance for Multi-label ClassificationMrKNN_Soft Relevance for Multi-label Classification
MrKNN_Soft Relevance for Multi-label Classification
 
Multi-label, Multi-class Classification Using Polylingual Embeddings
Multi-label, Multi-class Classification Using Polylingual EmbeddingsMulti-label, Multi-class Classification Using Polylingual Embeddings
Multi-label, Multi-class Classification Using Polylingual Embeddings
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Multi-Class Classification on Cartographic Data(Forest Cover)
Multi-Class Classification on Cartographic Data(Forest Cover)Multi-Class Classification on Cartographic Data(Forest Cover)
Multi-Class Classification on Cartographic Data(Forest Cover)
 
Voting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label ClassificationVoting Based Learning Classifier System for Multi-Label Classification
Voting Based Learning Classifier System for Multi-Label Classification
 
Svm implementation for Health Data
Svm implementation for Health DataSvm implementation for Health Data
Svm implementation for Health Data
 
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
CNN-RNN: A Unified Framework for Multi-label Image Classification@CV勉強会35回CVP...
 
Naïve multi label classification of you tube comments using
Naïve multi label classification of you tube comments usingNaïve multi label classification of you tube comments using
Naïve multi label classification of you tube comments using
 

Ähnlich wie Tag Extraction Final Presentation - CS185CSpring2014

Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytechyannick grenzinger
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxSurendra Gusain
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxSurendra Gusain
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
A Machine learning approach to classify a pair of sentence as duplicate or not.
A Machine learning approach to classify a pair of sentence as duplicate or not.A Machine learning approach to classify a pair of sentence as duplicate or not.
A Machine learning approach to classify a pair of sentence as duplicate or not.Pankaj Chandan Mohapatra
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBLisa Roth, PMP
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
Single Responsibility Principle
Single Responsibility PrincipleSingle Responsibility Principle
Single Responsibility PrincipleBADR
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Devry CIS 355A Full Course Latest
Devry CIS 355A Full Course LatestDevry CIS 355A Full Course Latest
Devry CIS 355A Full Course LatestAtifkhilji
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 

Ähnlich wie Tag Extraction Final Presentation - CS185CSpring2014 (20)

Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytech
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docx
 
Top 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docxTop 10 Interview Questions for Coding Job.docx
Top 10 Interview Questions for Coding Job.docx
 
Clean code
Clean codeClean code
Clean code
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
A Machine learning approach to classify a pair of sentence as duplicate or not.
A Machine learning approach to classify a pair of sentence as duplicate or not.A Machine learning approach to classify a pair of sentence as duplicate or not.
A Machine learning approach to classify a pair of sentence as duplicate or not.
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
 
PRG 420 Entire Course NEW
PRG 420 Entire Course NEWPRG 420 Entire Course NEW
PRG 420 Entire Course NEW
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Study_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_SystemsStudy_of_Sequence_labeling_Systems
Study_of_Sequence_labeling_Systems
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
Single Responsibility Principle
Single Responsibility PrincipleSingle Responsibility Principle
Single Responsibility Principle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Devry CIS 355A Full Course Latest
Devry CIS 355A Full Course LatestDevry CIS 355A Full Course Latest
Devry CIS 355A Full Course Latest
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Kürzlich hochgeladen (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Tag Extraction Final Presentation - CS185CSpring2014

  • 1. Tag Extraction George McBay, Naoki Nakatani San Jose State University CS185C Spring 2014
  • 2. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 3. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 4. Problem Given question with title and body, can we automatically generate tags for it? Where can I find the LaTeX3 manual? Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web. Does anyone have a link? Documentation latex3 expl3
  • 5. Dataset File : ● Train.csv ● Test.csv Fields : ● id, title, body, tags (Train) ● id, title, body (Test) Characteristics : ● Quoted csv ● Body contains n ● Tags separated by space ● Entry delimited by 0 0 “----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” , 0 0 0 “----” , ”-----------” , “------------------------” , “--- --- --- ---” “----” , ”-----------” , “------------------------” , “--- --- --- ---” “----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” ,
  • 6. Working Environment ● Mac OS 10.9.1 ● Apache Hadoop 1.2.1 ● Apache Mahout 0.8 ● Apache Spark 0.9
  • 7. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 8. ETL Extract : Assume data is extracted from website Transform : Use OpenCSV 1. Remove whitespaces (‘ ’, ‘n’, ‘t’) 2. Combine fields with ‘t’ 3. Write to tsv file Load : Upload to HDFS
  • 9. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 10. Data Analysis Tag Occurrence Count TSV File Map-Reduce • Input : <index, question> • Mapper output : <tag, 1> for each tag • Reducer output : <tag, count> for each tag 7785 c# 6788 java 6575 php 6135 javascript 5317 android 4949 jquery 3278 c++ 3082 python
  • 11. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 12. Question Filtering for ML TSV File Map-Reduce • Input : <index, question> • Mapper output : <index, question> if question contains top5 tag • Reducer output : <index, question> TSV File with questions that has one of top5 tags
  • 13. Machine Learning ● Problem ○ Can we classify questions into one of 5 categories (tags) ? Classification ● Naive Bayes Classifier ● Detail in Mahout Classification Presentation
  • 14. Machine Learning Correctly Classified Instances : 10209 81.8816% Incorrectly Classified Instances : 2259 18.1184% Total Classified Instances : 12468
  • 15. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 16. Title vs Body Intuitively… Title is a short summary describing the body of the question ⇒ Title must be more important than body! How to put more emphasis on title? ● Build separate models for title & body + more weight for title model? ● Prepend title several times and feed into regular model?
  • 17. Two models approach Title model not accurate… ● Too short for model to distinguish labels ● Longer text wins!
  • 18. Repeated title approach Slight improvement! ● Testing against train-set ~ 93% ⇒ ~ 95% ● Testing against test-set ~ 80% ⇒ ~ 82% Multiple title ● more stop words ⇒ No effect ● more keywords (if title has)
  • 19. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 20. Diving into model ● Top 10 words from each category ● Popular (redundant) words showing up in all categories (I, it, code, etc) BUT ● Some words specific to each category (activity for android, jquery for javascript, echo for php)
  • 21. Which words to drop? Word count against TrainSmall.tsv? ● Total count : 19276034 Top 5: ● p - 827029 ● the - 545950 ● i - 476056 ● to - 393027 ● a - 362328 Problem ● Key words have high count too ○ 39th - http - 51412wc ○ 63rd - java - 35076wc ○ 91st - php - 25135wc Can’t even throw away first 100 words...
  • 22. Which words to drop? Word count against ordinary english text? ● 20 books from gutenberg.org ● Total count : 1041565 ● A lot less technical! (only 4wc for java, probably an island from Indonesia?) ● Safe to throw away 1959 words (> 50wc)
  • 23. BUT
  • 24. Not much improvement... ● Due to tf-idf measurement ○ Less weight for words appearing in many documents ○ More weight for words appearing only in specific documents
  • 25. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 26. Any room for improvement? What is the source of error? ● android ⇔ java ==> both java ● javascript ⇔ php ===> both web-related ● java classified as c# ===> many questions have both tags
  • 27. Any room for improvement? No problem if we can give multiple labels to one question!
  • 28. Multi-label classification ● Modification from previous classification task ○ Top5 tags ⇒ Top1000 tags ○ 1 tag for 1 question ⇒ 5 tags for 1 question (Pick 5 most probable tags) ○ 1 question learned only once ⇒ 1 question with multiple tags learned multiple times tag1 body tag2 body model
  • 29. Good outcome (Example 1) TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p> Actual tags ● iphone ● ios ● upgrade Predicted tags ● iphone ● ios ● osx ● objective-c ● php
  • 30. GREAT outcome (Example 2) TITLE: Is it possible to display an image in text field in html? BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p> Actual tags ● javascript ● jquery ● html ● css ● web Predicted tags ● javascript ● jquery ===> Never appears in text! ● html ● c# ● php
  • 31. Stats Row : # actual tags assigned to one question Col : # predicted tags which are also in actual tag set [Ex] Out of total 32798 questions which have 2 tags: ● For 14541 questions, model suggested both 2 actual tags. ● For 13922 questions, model suggested 1 of 2 actual tags. ● For 4335 questions, model couldn’t suggest the correct tags.
  • 32. How to evaluate Generous evaluator If model gets at least 1 correct, approve it! Total accuracy = 83.55% (B)
  • 33. How to evaluate Strict evaluator Never approve unless model gets all correct! Total accuracy = 43.04% (F)
  • 34. Conclusion for performance ● Overall, good! ○ Predicted tag set is relatively close to the actual tag set (Apple-related, Web-related) ● but, not there yet... ○ Almost impossible to distinguish versions (c#-3.0, c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api, facebook-like ⇒ facebook) ○ Still showing unrelated tags (php python everywhere!)
  • 35. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 36. Spark Advantages: - Easy to get started with - Interactive shell - Less code to write
  • 37. Spark Disadvantages: - Not many reference for MLlib - Still new
  • 38. Spark ● Used PySpark which is python interface to using Spark ● Implemented ML model from ground-up using python dictionaries and mapreduce procedure
  • 39. How It Works 5 basic procedures used: ● map ● flatMap ● reduce ● reduceByKey ● collectAsMap
  • 40. How It Works key_val = line.flatMap(~).map(~) key_val = key_val.reduceByKey(~) (a, 1) (b, 1) (c, 1) (d, 1) (a, 1) (b, 1) (c, 1) (d, 1) (a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2) LINE
  • 41. How It Works dict = key_val.collectAsMap() {a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)
  • 42. How It Works Model: - statistical model - matrix of weights - uses tf-idf
  • 44. How It Works Tags Words from document
  • 45. How It Works Tags Relevance Words from document
  • 46. How It Works Implemented as → { tag : { word : wight } }
  • 47. How It Works ● Most relevant tag chosen by sum of weights associated to words contained in the document
  • 48. How It Works Now, how are the weights calculated? ● First calculate idf (inverse document frequency) for each word ● Next calculate tf (term frequency) associated with each tag ● Multiply idf to each entry then Normalize
  • 49. How It Works idf for a word defined by: idf(word) = log(D/F(word)) where, D = total # of doc in the training set F(word) = # of doc which contains word
  • 50. How It Works Two ways to calculate tf: 1) number of times you see the term associated with a tag 2) number of documents you see the term associated with a tag (in other words only count one time per doc)
  • 51. Results TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p> Actual tags ● iphone ● ios ● upgrade Predicted tags ● ios4.3 ● iphone-3gs ● cocoa-touch ● ios4 ● upgrade
  • 52. Results TITLE: Is it possible to display an image in text field in html? BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p> Actual tags ● javascript ● jquery ● html ● css ● web Predicted tags ● html ● img ● alignment ● get ● web
  • 54. Results ● Not perfect ● But very close ● Relevant words for tags look right
  • 55. Results most relevant words for tag “python”: [u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ] most relevant words for tag “math”: [u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example', u'matlab', ... ]
  • 56. Adjusting What can be adjusted? ● Pretty much anything! ● I tried playing with: tf, idf, tag_frequency, normalization, cleaning text, etc.
  • 57. Conclusion ● Adjusting the metrics to get the right model can be time consuming (many things can be adjusted)! ● But still, Naive Bayes algorithm is very suited for keyword extraction problem (and text classification in general), because of how tf- idf is defined.