1) The document discusses using collaborative tagging to design new web items that attract maximum tags. It defines the problem as predicting new item attributes to maximize desirable tags.
2) It proposes two algorithms - an Exact Two-Tier Top K algorithm (ETT) that runs in exponential time, and a Polynomial Time Approximation Scheme (PTAS) that runs in polynomial time with provable error bounds.
3) An experiment tests the algorithms on synthetic and real camera datasets. It analyzes performance quantitatively and uses a user study to qualitatively assess algorithm results.
1. Leveraging Collaborative Tagging
for Web Item Design
Mahashweta Das, Gautam Das
, Vagelis Hristidis
Presenter : Ajith C Ajjarani
[1000-727269]
1/15/2012
1
2. Outline : Organization of Presentation!
Motivation &
Problem Definition
Naïve
Bayes
Classifier
Tag Maximization :
NP Complete
Moderate
Instances Larger Instances
Exact 2 Tier : Top Approximation
K Algorithm Algorithm
Experiment &
result Tabulation
1/15/2012 2
3. Motivation
Lets Define this Can I design a New
Opportunity as Camera Which Attracts
problem ! & maximizes the Tags ??
1/15/2012 3
4. Problem Construction ? Training
Data
Attributes are product
definition
Tags are user-defined
Now, given subset of subjective “Desired“ Tags predict a
New Item( a combination of Attribute values)
Extend this to “Top K” version for potential k Items with
highest expected number of desirable Tags.
1/15/2012 4
5. Problem Statement
• Given a database of tagged products, task is to design k new
products (attribute values) that are likely to attract maximum
number of desirable tags
– tag-desirability is just one aspect of product design consideration
Zoom? Flash?
• Applications Resolution?
– electronics, autos, apparel
– musical artist, blogger
Light
Sensitivity?
Shooting
mode?
6. Tag Maximization
Technically challenging, as complex dependencies exist between
tags and items
Difficult to determine a combination of attribute values that
maximizes the expected number of desirable tags.
“Naïve Bayes” Classifier for Tag Prediction.
Even for this Classifier(assumption of simplistic Conditional
Independence), Tag maximization problem is NP- Complete.
Researchers have
NOT resorted to
Heuristics
Developed Principal Algorithms
1/15/2012 6
7. Proposed Solution
Exact – Top K Algorithm (ETT) performs significantly better than naïve
brute force algorithm.
(No need to compute all possible products )
Application of Rank-Join and TA top-k algorithm in a two-tier architecture
In the worst case, may have exponential running time
Approximation Algorithm (Poly Time Approximation Scheme) with
provable Error bounds
The algorithm’s overall running time is exponential only in the (constant)
size of the groups, but can be reduced to a polynomial time complexity.
For Large datasets
8. Problem Framework Boolean
Dataset
• D = {o1, o2, ..., on}
• A = {A1,A2, ..., Am}
• T = {T1, T2, ..., Tr }
Each item is thus a vector of size (m + r)
Eg :
• Above such dataset has been used as a training set to build
Naive Bayes Classifiers (NBC) & compute P (Tag | Attributes)
1/15/2012 8
9. Derived Results
The probability that a
new item o is annotated
by the tag Tj
Probability Pr(Tj ‘ | o) of
an item o not having
tag Tj :
1/15/2012 9
10. Derived Results
Derived :
Rj :
Convenience
Expected number of desirable tags Td = {T1, . . . , Tz} ⊆ T .
new Item(o) is annotated with:
11. Exact Algorithm
• Naïve brute-force
– Consider all possible 2m products and compute for each
possible product
– Exponential Complexity
• Exact two-tier top-k (ETT)
– Application of Rank-Join and TA top-k algorithm in a two-tier
architecture
– Does not need to compute all possible products
• performs significantly better than naïve brute-force
– Works well for moderate data instances, does not scale to larger data
• In the worst case, may have exponential running time
12. ETT: Two Tier Architecture
Match these Items in
tier-2 to compute
global best product
across all tags
Determine “best”
Item for each
tag(T1,T2..Tz) in
tier-1
Z – desirable Tags
m‘ =m / l
13. ETT Algorithm(Exemplification)
• Database: {A1, A2, A3, A4 } and {T1, T2} and top-1
– Partition attributes into 2 groups {A1, A2} and {A3, A4 } to form 2 lists of
Run NBCpartial products
&
Calculate
– Each list has ( 2m‘ ) 22= 4 entries (partial products)
– Compute score for each partial product for each tag using
and sort in descending order
18. Approximation Algorithm
Z Desirable tags
€ = 2σm
Z/Z‘ Subgroups σ = Compression
factor
Z ‘ Tags Z ‘ Tags
T1,T2… Tz ‘ T3,T4 … Tz ‘
Solved using
Top K Items for PTAS
Each Subgroup in polynomial
time defined for
O11,O12…O1k O21,O22…O2k Approximation
factor €
Overall Top K
O1,O2…Ok Items
1/15/2012 18
19. PTAS Algorithm Design
For K = 1 &
Z Desirable tags 1 Sub Group
€>0
Z =Z ‘ Tags
T1,T2… Tz ‘
PTAS Should run in Polynomial Time &
Top K =1 Item
Invariant
for This
Exact Score (Oa) >= (1- €) Exact Score (Og)
Subgroup
Oa
Oa PTAS returned Item
Og Optimal Item
1/15/2012 19
20. PTAS Algorithm Design
Simple exponential time exact top-1 algorithm for the sub-problem is created
& then deduced to PTAS
Given (m ) Boolean attributes and Z ‘ tags,
the exponential time algorithm makes m iterations
Initial step :
Produces the set S0 consisting of the single item {0m} along with its Z ‘ scores,
one for each tag.
first iteration,
it produces the set containing two items S1 = {0m, 10m−1}
each accompanied by its Z ‘ scores, one for each tag.
ith iteration, it produces the set of items
Si = {{0, 1}i×0m−1} along with their z scores, one for each tag.
final set Sm contains all 2m items along exact scores, from which the top-1 item can
be returned,
1/15/2012 20
21. PTAS Algorithm Design
Consider this Table
Z = Z‘ = 2
σ = 0.5
m=4
€ = (2σm) = 4
Og = {1110} Oa = {1111}
[1.77] = [0.89+0.88] [1.75] = [0.82+0.93]
1/15/2012 21
23. Experiment
Synthetic and real datasets for quantitative and
qualitative analysis of proposed algorithms
Quantitative performance indicators are :
efficiency of the proposed exact and approximation
algorithm.
Obtained Approximation factor of results produced by
the approximation algorithm
Qualitative results of algorithms :
Amazon Mechanical Turk user study to assess the
results of algorithms.
1/15/2012 23
24. Experiment
Real Camera Dataset :
Crawled a real dataset of 100 cameras listed at Amazon .
The listed camera’s contain technical details (attributes) & tags
customers associate with each camera.
The tags are sanitized to remove synonyms, unintelligent and
undesirable tags such as Nikon coolpix, quali, bad, etc.
Synthetic Dataset :
Boolean matrix of dimension 10,000 (items) × 100 (50 attributes +50 tags)
50 independent distributed attributes into 4 groups, where the
value is set to 1 with probabilities of 0.75, 0.15, 0.10 and 0.05
50 tags, predefined relations by randomly picking a set of attributes that are
correlated
1/15/2012 24
25. Quantitative : Performance
Exact Algorithm:
• Synthetic dataset having 1000 items, 16 attributes and 8 tags
(Naïve Vs ETT)
1/15/2012 25
26. Quantitative : Performance
Below figure, reveals that ETT is extremely slow beyond number of
attributes (m) = 16
PA with an approximation factor =0.5, continues to return guaranteed results
in reasonable time with increasing number of attributes m
1/15/2012 26
27. Quantitative : Performance
Execution time &
obtained approximation
factor
Synthetic dataset
1000 items, 20
attributes & 8 tags
Top 1 Item is considered.
1/15/2012 27
28. Qualitative : User Study
First part of User study :
PA algorithm with an approximation factor =0.5, by considering tag sets
corresponding to compact cameras and slr cameras respectively.
Built 4 new cameras (2 digital compact & 2digital slr) PA algorithm € =0. 5
Vs
4 existing popular cameras
65% of users choose the new cameras
1/15/2012 28
29. Qualitative : User Study
Second part of the study :
Built 6 new cameras designed for three groups : 2 potential new
cameras for each
1. young students Group
2. old retired
3. professional photographers.
When asked with users to assign at least five
tags : observation : majority of the users
rightly classify the six cameras into the three
groups
1/15/2012 29
30. Conclusion
Define the Tag Maximization problem & investigate its
computational complexity.
Propose 2 novel Algorithms & shown the practicability
This work is a preliminary look at a very novel area of
research & promises exciting directions of future
research.
Decision trees, SVMs, and regression trees classifiers
are to used & Conduct the experiment
1/15/2012 30
Tag maximization problem : How to decide the attribute values of new items (Ox) and to return the top-k “best” items that are likely to attract the maximum number of desirable tags