Tag Extraction Final Presentation - CS185CSpring2014

Tag Extraction
George McBay, Naoki Nakatani
San Jose State University
CS185C Spring 2014

Agenda
Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark

Problem Description
ETL
Data Analysis
Machine Learning
Optimization
Feature Engineering
Title vs Body
Stop Words
Multi-label Classification
Apache Spark
Agenda

Problem
Given question with title and body, can we
automatically generate tags for it?
Where can I find the
LaTeX3 manual?
Few month ago I saw a big pdf-manual of all
LaTeX3-packages and the new syntax. I think
it was bigger than 300 pages. I can't find it on
the web.
Does anyone have a link?
Documentation
latex3
expl3

Dataset
File :
● Train.csv
● Test.csv
Fields :
● id, title, body, tags (Train)
● id, title, body (Test)
Characteristics :
● Quoted csv
● Body contains n
● Tags separated by space
● Entry delimited by 0
0
“----” , ”-----------” , “------------------------ “--- --- --- ---”
------------------------” , 0
0
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------ “--- --- --- ---”
------------------------” ,

Working Environment
● Mac OS 10.9.1
● Apache Hadoop 1.2.1
● Apache Mahout 0.8
● Apache Spark 0.9

ETL
Extract : Assume data is extracted from website
Transform : Use OpenCSV
1. Remove whitespaces (‘ ’, ‘n’, ‘t’)
2. Combine fields with ‘t’
3. Write to tsv file
Load : Upload to HDFS

Data Analysis
Tag Occurrence Count
TSV File
Map-Reduce
• Input : <index, question>
• Mapper output :
<tag, 1> for each tag
• Reducer output :
<tag, count> for each
tag
7785 c#
6788 java
6575 php
6135 javascript
5317 android
4949 jquery
3278 c++
3082 python

Question Filtering for ML
TSV File
Map-Reduce
• Input : <index, question>
• Mapper output :
<index, question> if question
contains top5 tag
• Reducer output :
<index, question>
TSV File
with questions
that has one of
top5 tags

Machine Learning
● Problem
○ Can we classify questions into one of 5 categories
(tags) ?
Classification
● Naive Bayes Classifier
● Detail in Mahout Classification Presentation

Machine Learning
Correctly Classified Instances : 10209 81.8816%
Incorrectly Classified Instances : 2259 18.1184%
Total Classified Instances : 12468

Title vs Body
Intuitively…
Title is a short summary describing the body of the question
⇒ Title must be more important than body!
How to put more emphasis on title?
● Build separate models for title & body + more weight for
title model?
● Prepend title several times and feed into regular model?

Two models approach
Title model not accurate…
● Too short for model to
distinguish labels
● Longer text wins!

Repeated title approach
Slight improvement!
● Testing against train-set
~ 93% ⇒ ~ 95%
● Testing against test-set
~ 80% ⇒ ~ 82%
Multiple title
● more stop words ⇒ No effect
● more keywords (if title has)

Diving into model
● Top 10 words from each category
● Popular (redundant) words
showing up in all categories (I, it,
code, etc)
BUT
● Some words specific to each
category (activity for android,
jquery for javascript, echo for php)

Which words to drop?
Word count against TrainSmall.tsv?
● Total count : 19276034
Top 5:
● p - 827029
● the - 545950
● i - 476056
● to - 393027
● a - 362328
Problem
● Key words have high count too
○ 39th - http - 51412wc
○ 63rd - java - 35076wc
○ 91st - php - 25135wc
Can’t even throw away first 100
words...

Which words to drop?
Word count against ordinary english text?
● 20 books from gutenberg.org
● Total count : 1041565
● A lot less technical! (only 4wc for java,
probably an island from Indonesia?)
● Safe to throw away 1959 words (> 50wc)

Not much improvement...
● Due to tf-idf measurement
○ Less weight for words appearing in many documents
○ More weight for words appearing only in specific
documents

Any room for improvement?
What is the source of error?
● android ⇔ java ==> both java
● javascript ⇔ php ===> both web-related
● java classified as c# ===> many questions have both tags

Any room for improvement?
No problem if we can give multiple labels
to one question!

Multi-label classification
● Modification from previous classification task
○ Top5 tags ⇒ Top1000 tags
○ 1 tag for 1 question ⇒ 5 tags for 1 question
(Pick 5 most probable tags)
○ 1 question learned only once ⇒ 1 question with
multiple tags learned multiple times
tag1 body
tag2 body
model

Good outcome (Example 1)
TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x
BODY: Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x
for testing. I have read some articles about it but it seems that those are too old
and cannot work. Does anyone have some experience in doing this or
does apple provide tutorials for developers in this? A lot of thanks.
Actual tags
● iphone
● ios
● upgrade
Predicted tags
● iphone
● ios
● osx
● objective-c
● php

GREAT outcome (Example 2)
TITLE: Is it possible to display an image in text field in html?
BODY: Can we display image inside a text field in <code>html</code>?
 Edit What I want to do is to have an
<code>editable</code> area, and want to add <code>html</code> objects
inside it(i.e. button, image ..etc)
Actual tags
● javascript
● jquery
● html
● css
● web
Predicted tags
● javascript
● jquery ===> Never appears in text!
● html
● c#
● php

Stats
Row : # actual tags assigned to one question
Col : # predicted tags which are also in actual tag set
[Ex] Out of total 32798 questions which have 2 tags:
● For 14541 questions, model suggested both 2 actual tags.
● For 13922 questions, model suggested 1 of 2 actual tags.
● For 4335 questions, model couldn’t suggest the correct tags.

How to evaluate
Generous evaluator
If model gets at least 1 correct, approve it!
Total accuracy = 83.55% (B)

How to evaluate
Strict evaluator
Never approve unless model gets all correct!
Total accuracy = 43.04% (F)

Conclusion for performance
● Overall, good!
○ Predicted tag set is relatively close to the actual tag
set (Apple-related, Web-related)
● but, not there yet...
○ Almost impossible to distinguish versions (c#-3.0,
c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api,
facebook-like ⇒ facebook)
○ Still showing unrelated tags (php python
everywhere!)

Spark
Advantages:
- Easy to get started with
- Interactive shell
- Less code to write

Spark
Disadvantages:
- Not many reference for MLlib
- Still new

Spark
● Used PySpark which is python interface to
using Spark
● Implemented ML model from ground-up
using python dictionaries and mapreduce
procedure

How It Works
5 basic procedures used:
● map
● flatMap
● reduce
● reduceByKey
● collectAsMap

How It Works
key_val = line.flatMap(~).map(~)
key_val = key_val.reduceByKey(~)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2)
LINE

How It Works
dict = key_val.collectAsMap()
{a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)

How It Works
Model:
- statistical model
- matrix of weights
- uses tf-idf

How It Works
Tags
Words from document

How It Works
Tags Relevance
Words from document

How It Works
Implemented as → { tag : { word : wight } }

How It Works
● Most relevant tag chosen by sum of weights
associated to words contained in the
document

How It Works
Now, how are the weights calculated?
● First calculate idf (inverse document
frequency) for each word
● Next calculate tf (term frequency) associated
with each tag
● Multiply idf to each entry then Normalize

How It Works
idf for a word
defined by:
idf(word) = log(D/F(word))
where,
D = total # of doc in the training set
F(word) = # of doc which contains word

How It Works
Two ways to calculate tf:
1) number of times you see the term
associated with a tag
2) number of documents you see the term
associated with a tag (in other words only
count one time per doc)

Results
TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x
BODY: Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x
for testing. I have read some articles about it but it seems that those are too old
and cannot work. Does anyone have some experience in doing this or
does apple provide tutorials for developers in this? A lot of thanks.
Actual tags
● iphone
● ios
● upgrade
Predicted tags
● ios4.3
● iphone-3gs
● cocoa-touch
● ios4
● upgrade

Results
TITLE: Is it possible to display an image in text field in html?
BODY: Can we display image inside a text field in <code>html</code>?
 Edit What I want to do is to have an
<code>editable</code> area, and want to add <code>html</code> objects
inside it(i.e. button, image ..etc)
Actual tags
● javascript
● jquery
● html
● css
● web
Predicted tags
● html
● img
● alignment
● get
● web

Results
Top:
Predicted
Below:
Actual

Results
● Not perfect
● But very close
● Relevant words for tags look right

Results
most relevant words for tag “python”:
[u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ]
most relevant words for tag “math”:
[u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example',
u'matlab', ... ]

Adjusting
What can be adjusted?
● Pretty much anything!
● I tried playing with: tf, idf, tag_frequency,
normalization, cleaning text, etc.

Conclusion
● Adjusting the metrics to get the right model
can be time consuming (many things can be
adjusted)!
● But still, Naive Bayes algorithm is very suited
for keyword extraction problem (and text
classification in general), because of how tf-
idf is defined.

Tag Extraction Final Presentation - CS185CSpring2014

Tag Extraction Final Presentation - CS185CSpring2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (14)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Tag Extraction Final Presentation - CS185CSpring2014

Ähnlich wie Tag Extraction Final Presentation - CS185CSpring2014 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Tag Extraction Final Presentation - CS185CSpring2014