These slides were presented in class on May 7th 2014.
Task allocation
• George : ETL, Data Analysis, Machine Learning, Multi-label classification with Apache Spark
• Naoki : ETL, Data Analysis, Machine Learning, Feature Engineering, Multi-label classification with Apache Mahout
4. Problem
Given question with title and body, can we
automatically generate tags for it?
Where can I find the
LaTeX3 manual?
Few month ago I saw a big pdf-manual of all
LaTeX3-packages and the new syntax. I think
it was bigger than 300 pages. I can't find it on
the web.
Does anyone have a link?
Documentation
latex3
expl3
12. Question Filtering for ML
TSV File
Map-Reduce
• Input : <index, question>
• Mapper output :
<index, question> if question
contains top5 tag
• Reducer output :
<index, question>
TSV File
with questions
that has one of
top5 tags
13. Machine Learning
● Problem
○ Can we classify questions into one of 5 categories
(tags) ?
Classification
● Naive Bayes Classifier
● Detail in Mahout Classification Presentation
16. Title vs Body
Intuitively…
Title is a short summary describing the body of the question
⇒ Title must be more important than body!
How to put more emphasis on title?
● Build separate models for title & body + more weight for
title model?
● Prepend title several times and feed into regular model?
17. Two models approach
Title model not accurate…
● Too short for model to
distinguish labels
● Longer text wins!
18. Repeated title approach
Slight improvement!
● Testing against train-set
~ 93% ⇒ ~ 95%
● Testing against test-set
~ 80% ⇒ ~ 82%
Multiple title
● more stop words ⇒ No effect
● more keywords (if title has)
20. Diving into model
● Top 10 words from each category
● Popular (redundant) words
showing up in all categories (I, it,
code, etc)
BUT
● Some words specific to each
category (activity for android,
jquery for javascript, echo for php)
21. Which words to drop?
Word count against TrainSmall.tsv?
● Total count : 19276034
Top 5:
● p - 827029
● the - 545950
● i - 476056
● to - 393027
● a - 362328
Problem
● Key words have high count too
○ 39th - http - 51412wc
○ 63rd - java - 35076wc
○ 91st - php - 25135wc
Can’t even throw away first 100
words...
22. Which words to drop?
Word count against ordinary english text?
● 20 books from gutenberg.org
● Total count : 1041565
● A lot less technical! (only 4wc for java,
probably an island from Indonesia?)
● Safe to throw away 1959 words (> 50wc)
24. Not much improvement...
● Due to tf-idf measurement
○ Less weight for words appearing in many documents
○ More weight for words appearing only in specific
documents
26. Any room for improvement?
What is the source of error?
● android ⇔ java ==> both java
● javascript ⇔ php ===> both web-related
● java classified as c# ===> many questions have both tags
27. Any room for improvement?
No problem if we can give multiple labels
to one question!
28. Multi-label classification
● Modification from previous classification task
○ Top5 tags ⇒ Top1000 tags
○ 1 tag for 1 question ⇒ 5 tags for 1 question
(Pick 5 most probable tags)
○ 1 question learned only once ⇒ 1 question with
multiple tags learned multiple times
tag1 body
tag2 body
model
29. Good outcome (Example 1)
TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x
BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x
for testing. I have read some articles about it but it seems that those are too old
and cannot work. </p> <p>Does anyone have some experience in doing this or
does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags
● iphone
● ios
● upgrade
Predicted tags
● iphone
● ios
● osx
● objective-c
● php
30. GREAT outcome (Example 2)
TITLE: Is it possible to display an image in text field in html?
BODY: <p>Can we display image inside a text field in <code>html</code>?
</p> <p><strong>Edit</strong></p> <p>What I want to do is to have an
<code>editable</code> area, and want to add <code>html</code> objects
inside it(i.e. button, image ..etc)</p>
Actual tags
● javascript
● jquery
● html
● css
● web
Predicted tags
● javascript
● jquery ===> Never appears in text!
● html
● c#
● php
31. Stats
Row : # actual tags assigned to one question
Col : # predicted tags which are also in actual tag set
[Ex] Out of total 32798 questions which have 2 tags:
● For 14541 questions, model suggested both 2 actual tags.
● For 13922 questions, model suggested 1 of 2 actual tags.
● For 4335 questions, model couldn’t suggest the correct tags.
32. How to evaluate
Generous evaluator
If model gets at least 1 correct, approve it!
Total accuracy = 83.55% (B)
33. How to evaluate
Strict evaluator
Never approve unless model gets all correct!
Total accuracy = 43.04% (F)
34. Conclusion for performance
● Overall, good!
○ Predicted tag set is relatively close to the actual tag
set (Apple-related, Web-related)
● but, not there yet...
○ Almost impossible to distinguish versions (c#-3.0,
c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api,
facebook-like ⇒ facebook)
○ Still showing unrelated tags (php python
everywhere!)
38. Spark
● Used PySpark which is python interface to
using Spark
● Implemented ML model from ground-up
using python dictionaries and mapreduce
procedure
39. How It Works
5 basic procedures used:
● map
● flatMap
● reduce
● reduceByKey
● collectAsMap
40. How It Works
key_val = line.flatMap(~).map(~)
key_val = key_val.reduceByKey(~)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2)
LINE
41. How It Works
dict = key_val.collectAsMap()
{a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)
47. How It Works
● Most relevant tag chosen by sum of weights
associated to words contained in the
document
48. How It Works
Now, how are the weights calculated?
● First calculate idf (inverse document
frequency) for each word
● Next calculate tf (term frequency) associated
with each tag
● Multiply idf to each entry then Normalize
49. How It Works
idf for a word
defined by:
idf(word) = log(D/F(word))
where,
D = total # of doc in the training set
F(word) = # of doc which contains word
50. How It Works
Two ways to calculate tf:
1) number of times you see the term
associated with a tag
2) number of documents you see the term
associated with a tag (in other words only
count one time per doc)
51. Results
TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x
BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x
for testing. I have read some articles about it but it seems that those are too old
and cannot work. </p> <p>Does anyone have some experience in doing this or
does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags
● iphone
● ios
● upgrade
Predicted tags
● ios4.3
● iphone-3gs
● cocoa-touch
● ios4
● upgrade
52. Results
TITLE: Is it possible to display an image in text field in html?
BODY: <p>Can we display image inside a text field in <code>html</code>?
</p> <p><strong>Edit</strong></p> <p>What I want to do is to have an
<code>editable</code> area, and want to add <code>html</code> objects
inside it(i.e. button, image ..etc)</p>
Actual tags
● javascript
● jquery
● html
● css
● web
Predicted tags
● html
● img
● alignment
● get
● web
55. Results
most relevant words for tag “python”:
[u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ]
most relevant words for tag “math”:
[u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example',
u'matlab', ... ]
56. Adjusting
What can be adjusted?
● Pretty much anything!
● I tried playing with: tf, idf, tag_frequency,
normalization, cleaning text, etc.
57. Conclusion
● Adjusting the metrics to get the right model
can be time consuming (many things can be
adjusted)!
● But still, Naive Bayes algorithm is very suited
for keyword extraction problem (and text
classification in general), because of how tf-
idf is defined.