SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Leveraging Collaborative Tagging
      for Web Item Design

     Mahashweta Das, Gautam Das
          , Vagelis Hristidis


       Presenter : Ajith C Ajjarani
            [1000-727269]
                                      1/15/2012
                                                  1
Outline : Organization of Presentation!
                                Motivation &
                             Problem Definition
                                                            Naïve
                                                           Bayes
                                                          Classifier
                             Tag Maximization :
                               NP Complete
            Moderate
            Instances                             Larger Instances

                 Exact 2 Tier : Top       Approximation
                   K Algorithm              Algorithm




                                Experiment &
                              result Tabulation

1/15/2012                                                              2
Motivation




            Lets Define this     Can I design a New
            Opportunity as     Camera Which Attracts
               problem !       & maximizes the Tags ??
1/15/2012                                                3
Problem Construction ?                        Training
                                                            Data

 Attributes are product
  definition
 Tags are user-defined




   Now, given subset of subjective “Desired“ Tags  predict a
           New Item( a combination of Attribute values)
   Extend this to “Top K” version for potential k Items with
   highest expected number of desirable Tags.
1/15/2012                                                       4
Problem Statement
• Given a database of tagged products, task is to design k new
  products (attribute values) that are likely to attract maximum
  number of desirable tags
   – tag-desirability is just one aspect of product design consideration

                                                         Zoom? Flash?
• Applications                             Resolution?

   – electronics, autos, apparel
   – musical artist, blogger

                                    Light
                                   Sensitivity?
                                     Shooting
                                     mode?
Tag Maximization
Technically challenging, as complex dependencies exist between
   tags and items

Difficult to determine a combination of attribute values that
 maximizes the expected number of desirable tags.
“Naïve Bayes” Classifier for Tag Prediction.
  Even for this Classifier(assumption of simplistic Conditional
 Independence), Tag maximization problem is NP- Complete.
                                      Researchers have
                                      NOT resorted to
                                      Heuristics
                     Developed Principal Algorithms

1/15/2012                                                         6
Proposed Solution

 Exact – Top K Algorithm (ETT)  performs significantly better than naïve
  brute force algorithm.
     (No need to compute all possible products )
 Application of Rank-Join and TA top-k algorithm in a two-tier architecture
 In the worst case, may have exponential running time



 Approximation Algorithm  (Poly Time Approximation Scheme) with
  provable Error bounds
 The algorithm’s overall running time is exponential only in the (constant)
  size of the groups, but can be reduced to a polynomial time complexity.
 For Large datasets
Problem Framework                        Boolean
                                                         Dataset

• D = {o1, o2, ..., on}
• A = {A1,A2, ..., Am}
• T = {T1, T2, ..., Tr }



 Each item is thus a vector of size (m + r)
Eg :



• Above such dataset has been used as a training set to build
  Naive Bayes Classifiers (NBC) & compute P (Tag | Attributes)

1/15/2012                                                          8
Derived Results
The probability that a
 new item o is annotated
 by the tag Tj

Probability Pr(Tj ‘ | o) of
 an item o not having
 tag Tj :




1/15/2012                         9
Derived Results
Derived :



                                               Rj :
                                           Convenience


Expected number of desirable tags Td = {T1, . . . , Tz} ⊆ T .
new Item(o) is annotated with:
Exact Algorithm
• Naïve brute-force
   – Consider all possible 2m products and compute              for each
     possible product
   – Exponential Complexity


• Exact two-tier top-k (ETT)
   – Application of Rank-Join and TA top-k algorithm in a two-tier
     architecture
   – Does not need to compute all possible products
       • performs significantly better than naïve brute-force
   – Works well for moderate data instances, does not scale to larger data
       • In the worst case, may have exponential running time
ETT: Two Tier Architecture
Match these Items in
tier-2 to compute
global best product
across all tags




Determine “best”
Item for each
tag(T1,T2..Tz) in
tier-1




                         Z – desirable Tags
                         m‘ =m / l
ETT Algorithm(Exemplification)
• Database: {A1, A2, A3, A4 } and {T1, T2} and top-1
     – Partition attributes into 2 groups {A1, A2} and {A3, A4 } to form 2 lists of
Run NBCpartial products
         &
 Calculate
     – Each list has ( 2m‘ )  22= 4 entries (partial products)
     – Compute score for each partial product for each tag using
                                               and sort in descending order
Buffer               Product           Complete Score       MUS: sum of last seen   Tier 2
                     Top-K ()                                                     score from all
                                                ..                 ..
                                                                                  GetNext()
                                                ..                 ..

           GetNext() =                                        GetNext() =


       (A1 A2)            (A3 A4)                       (A1 A2)              (A3 A4)
       10, 1.97           10, 1.97                      11, 2.76             11, 4.57                     Tier 1
       00, 0.84           00, 0.84                      01, 1.18             10, 2.53
       11, 0.84           11, 0.84                      10, 1.18             01, 0.91
       01, 0.36           01, 0.36                      00, 0.51             00, 0.51
           L1                        L2                       L1                 L2
                     T1                                                 T2
Join      Product         Partial         MPFS       MPFS:
                          Score
 ..             ..              ..         ..
                                                     Actual/
 ..             ..              ..         ..
                                                     Complete :
                                                     Score
Buffer                Product    Complete Score                                            Tier 2
                         Top-K ()                                                    MinK (1.75) <= MUS (1.88)
Iteration 1                                      1111           1.75
                                                 1010           1.70                         Return
                                                                                             to Tier 1
              GetNext( ) = 1010                           GetNext( ) = 1111


              (A1 A2)               (A3 A4)               (A1 A2)                (A3 A4)
              10, 1.97              10, 1.97              11, 2.76               11, 4.57                            Tier 1
              00, 0.84              00, 0.84              01, 1.18               10, 2.53
Rank                                                                                              Join
Join          11, 0.84              11, 0.84              10, 1.18               01, 0.91
              01, 0.36              01, 0.36              00, 0.51               00, 0.51
                  L11                   L12                   L21                    L22
                          T1                                                T2
       Join      Product        Partial         MPFS                 Join          Product     Partial        MPFS
                                Score                                                          Score
        1          1010             0.95   >=   0.95                   1             1111       0.93     >=   0.93

        2           ..                                                 ..             ..

        ..          ..                                                 ..             ..
Buffer                Product    Complete Score                                            Tier 2
                         Top-K ()                                                    MinK (1.77) <= MUS (1.79)
Iteration 2                                      1110           1.77
                                                 1011           1.76                         Return
                                                                                             to Tier 1
              GetNext( ) = 1011                           GetNext( ) = 1110


              (A1 A2)               (A3 A4)               (A1 A2)                (A3 A4)
              10, 1.97              10, 1.97              11, 2.76               11, 4.57                            Tier 1
              00, 0.84              00, 0.84              01, 1.18               10, 2.53
Rank                                                                                              Join
Join          11, 0.84              11, 0.84              10, 1.18               01, 0.91
              01, 0.36              01, 0.36              00, 0.51               00, 0.51
                  L11                   L12                   L21                    L22
                          T1                                                T2
       Join      Product        Partial         MPFS                 Join          Product     Partial        MPFS
                                Score                                                          Score
        1          1010             0.95        0.95                   1             1111       0.93          0.93

        2          1011             0.92   >=   0.92                   .2            1110       0.88     >=   0.88

        ..          ..                                                 ..             ..
Buffer                Product     Complete Score                                            Tier 2
                        Top-K ()                                                     MinK (1.77) <= MUS (1.74)
Iteration 3                                     0111            1.77
                                                0010            1.76                         ETT Terminates

              GetNext( ) = 0010                           GetNext( ) = 0111


              (A1 A2)              (A3 A4)
                                       Thus, ETT returns(A1 A2)
                                                           the                   (A3 A4)
              10, 1.97             10, 1.97 Best Item 11, 2.76                   11, 4.57                            Tier 1
                                        (0111 or 1110) in Just
              00, 0.84             00, 0.84 Item Look -up 1.18
                                           6              01,                    10, 2.53
Rank                                                                                              Join
Join          11, 0.84             11, 0.84               10, 1.18               01, 0.91
              01, 0.36             01, 0.36              00, 0.51                00, 0.51
                  L11                  L12                   L21                     L22
                          T1                                                T2
       Join      Product       Partial         MPFS                  Join          Product     Partial        MPFS
                               Score                                                           Score
        1          1010            0.95        0.95                    1             1111       0.93          0.93

        2          1011            0.92        0.92                    .2            1110       0.88          0.88

        3          0010            0.89   >=   0.89                    3             0111       0.84     >=   0.84
Approximation Algorithm
                Z Desirable tags
                                                                           € = 2σm
                                              Z/Z‘ Subgroups                σ = Compression
                                                                           factor


    Z ‘ Tags                                                    Z ‘ Tags
                   T1,T2… Tz ‘                  T3,T4 … Tz ‘
                                                                            Solved using
Top K Items for                                                             PTAS
Each Subgroup                                                               in polynomial
                                                                            time defined for
                 O11,O12…O1k                    O21,O22…O2k                 Approximation
                                                                            factor €

                                                          Overall Top K
                                   O1,O2…Ok               Items
    1/15/2012                                                                              18
PTAS Algorithm Design
                                            For K = 1 &
                Z Desirable tags            1 Sub Group
                                            €>0




   Z =Z ‘ Tags
                  T1,T2… Tz ‘
                                               PTAS Should run in Polynomial Time &
Top K =1 Item
                                               Invariant
for This
                                               Exact Score (Oa) >= (1- €) Exact Score (Og)
Subgroup
                      Oa


                           Oa  PTAS returned Item
                           Og  Optimal Item
    1/15/2012                                                                            19
PTAS Algorithm Design
 Simple exponential time exact top-1 algorithm for the sub-problem is created
 & then deduced to PTAS
  Given (m ) Boolean attributes and Z ‘ tags,
  the exponential time algorithm makes m iterations

    Initial step :
     Produces the set S0 consisting of the single item {0m} along with its Z ‘ scores,
    one for each tag.
    first iteration,
    it produces the set containing two items S1 = {0m, 10m−1}
    each accompanied by its Z ‘ scores, one for each tag.
     ith iteration, it produces the set of items
     Si = {{0, 1}i×0m−1} along with their z scores, one for each tag.
     final set Sm contains all 2m items along exact scores, from which the top-1 item can
     be returned,

1/15/2012                                                                                20
PTAS Algorithm Design




Consider this Table
Z = Z‘ = 2
σ = 0.5
m=4
€ = (2σm) = 4


                        Og = {1110}                  Oa = {1111}
                              [1.77] = [0.89+0.88]        [1.75] = [0.82+0.93]
   1/15/2012                                                           21
PTAS Algorithm Design
Cluster’s
item’s exact
underlined
score
should be
close to the
deleted
item’s exact
score.




  1/15/2012                            22
Experiment
      Synthetic and real datasets for quantitative and
      qualitative analysis of proposed algorithms

      Quantitative performance indicators are :
       efficiency of the proposed exact and approximation
        algorithm.
       Obtained Approximation factor of results produced by
        the approximation algorithm

       Qualitative results of algorithms :
      Amazon Mechanical Turk user study to assess the
      results of algorithms.

1/15/2012                                                      23
Experiment
    Real Camera Dataset :
    Crawled a real dataset of 100 cameras listed at Amazon .

    The listed camera’s contain technical details (attributes) & tags
    customers associate with each camera.

    The tags are sanitized to remove synonyms, unintelligent and
    undesirable tags such as Nikon coolpix, quali, bad, etc.

   Synthetic Dataset :
   Boolean matrix of dimension 10,000 (items) × 100 (50 attributes +50 tags)

    50 independent distributed attributes into 4 groups, where the
      value is set to 1 with probabilities of 0.75, 0.15, 0.10 and 0.05

   50 tags, predefined relations by randomly picking a set of attributes that are
   correlated
1/15/2012                                                                            24
Quantitative : Performance
Exact Algorithm:
• Synthetic dataset having 1000 items, 16 attributes and 8 tags
                            (Naïve Vs ETT)




1/15/2012                                                         25
Quantitative : Performance
  Below figure, reveals that ETT is extremely slow beyond number of
  attributes (m) = 16

  PA with an approximation factor =0.5, continues to return guaranteed results
  in reasonable time with increasing number of attributes m




1/15/2012                                                                         26
Quantitative : Performance
                          Execution time &
                          obtained approximation
                          factor
                           Synthetic dataset
                          1000 items, 20
                          attributes & 8 tags

                          Top 1 Item is considered.




1/15/2012                                             27
Qualitative : User Study
First part of User study :

PA algorithm with an approximation factor =0.5, by considering tag sets
corresponding to compact cameras and slr cameras respectively.


Built 4 new cameras (2 digital compact & 2digital slr) PA algorithm € =0. 5

                                       Vs
                             4 existing popular cameras


                       65% of users choose the new cameras




1/15/2012                                                                     28
Qualitative : User Study
Second part of the study :

Built 6 new cameras designed for three groups :   2 potential new
                                                  cameras for each
1. young students                                 Group
2. old retired
3. professional photographers.



When asked with users to assign at least five
tags : observation : majority of the users
rightly classify the six cameras into the three
groups




1/15/2012                                                            29
Conclusion
    Define the Tag Maximization problem & investigate its
     computational complexity.
    Propose 2 novel Algorithms & shown the practicability
    This work is a preliminary look at a very novel area of
     research & promises exciting directions of future
     research.
    Decision trees, SVMs, and regression trees classifiers
     are to used & Conduct the experiment




1/15/2012                                                      30
References
http://crystal.uta.edu/~gdas/Courses/websitepages/fall10DBIR.html




1/15/2012                                                           31
Questions?

Thank You

Weitere ähnliche Inhalte

Andere mochten auch

Distribution and delivery
Distribution and deliveryDistribution and delivery
Distribution and delivery
Kyle Academy
 
Developing a food safetytraining program for volunteers larry ramdin and y...
Developing a food safetytraining program  for volunteers   larry ramdin and y...Developing a food safetytraining program  for volunteers   larry ramdin and y...
Developing a food safetytraining program for volunteers larry ramdin and y...
lramdin
 
Improving food safety
Improving food safetyImproving food safety
Improving food safety
lramdin
 
Developing a food safetytraining program for volunteers larry ramdin and y...
Developing a food safetytraining program  for volunteers   larry ramdin and y...Developing a food safetytraining program  for volunteers   larry ramdin and y...
Developing a food safetytraining program for volunteers larry ramdin and y...
lramdin
 
Hvordan kan man arbeide med lesing av fagtekster
Hvordan kan man arbeide med lesing av fagteksterHvordan kan man arbeide med lesing av fagtekster
Hvordan kan man arbeide med lesing av fagtekster
hkbr
 
Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...
Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...
Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...
HVantage Technologies Inc. USA.
 
PrezentāCija Daba
PrezentāCija DabaPrezentāCija Daba
PrezentāCija Daba
pavasaris
 
Merdzenes pagasts
Merdzenes pagastsMerdzenes pagasts
Merdzenes pagasts
merdzene
 
Prezentācija daba
Prezentācija dabaPrezentācija daba
Prezentācija daba
pavasaris
 

Andere mochten auch (15)

Just in time pp
Just in time ppJust in time pp
Just in time pp
 
Distribution and delivery
Distribution and deliveryDistribution and delivery
Distribution and delivery
 
Developing a food safetytraining program for volunteers larry ramdin and y...
Developing a food safetytraining program  for volunteers   larry ramdin and y...Developing a food safetytraining program  for volunteers   larry ramdin and y...
Developing a food safetytraining program for volunteers larry ramdin and y...
 
Improving food safety
Improving food safetyImproving food safety
Improving food safety
 
Developing a food safetytraining program for volunteers larry ramdin and y...
Developing a food safetytraining program  for volunteers   larry ramdin and y...Developing a food safetytraining program  for volunteers   larry ramdin and y...
Developing a food safetytraining program for volunteers larry ramdin and y...
 
Hvordan kan man arbeide med lesing av fagtekster
Hvordan kan man arbeide med lesing av fagteksterHvordan kan man arbeide med lesing av fagtekster
Hvordan kan man arbeide med lesing av fagtekster
 
Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...
Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...
Dcnpl hills vistaa Indore- A Premium Township near Super Corridor having 2 & ...
 
En nationell infrastruktur för arkeologiska undersökningsdata
En nationell infrastruktur för arkeologiska undersökningsdataEn nationell infrastruktur för arkeologiska undersökningsdata
En nationell infrastruktur för arkeologiska undersökningsdata
 
PrezentāCija Daba
PrezentāCija DabaPrezentāCija Daba
PrezentāCija Daba
 
Merdzenes pagasts
Merdzenes pagastsMerdzenes pagasts
Merdzenes pagasts
 
Prezentācija daba
Prezentācija dabaPrezentācija daba
Prezentācija daba
 
Vs 13
Vs 13Vs 13
Vs 13
 
Vs 13
Vs 13Vs 13
Vs 13
 
Pavasaris
PavasarisPavasaris
Pavasaris
 
Cudderisback
CudderisbackCudderisback
Cudderisback
 

Ähnlich wie Leveraging collaborativetaggingforwebitemdesign ajithajjarani

Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
Suvadip Shome
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Kundjanasith Thonglek
 
Observations
ObservationsObservations
Observations
butest
 

Ähnlich wie Leveraging collaborativetaggingforwebitemdesign ajithajjarani (20)

Fast Optimization Intevac
Fast Optimization IntevacFast Optimization Intevac
Fast Optimization Intevac
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
Optimization Intevac Aug23 7f
Optimization Intevac Aug23 7fOptimization Intevac Aug23 7f
Optimization Intevac Aug23 7f
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.Proto
 
Easydd program3
Easydd program3Easydd program3
Easydd program3
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
 
[ppt]
[ppt][ppt]
[ppt]
 
Easydd program
Easydd programEasydd program
Easydd program
 
机器学习Adaboost
机器学习Adaboost机器学习Adaboost
机器学习Adaboost
 
OpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon ValleyOpenPOWER Workshop in Silicon Valley
OpenPOWER Workshop in Silicon Valley
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Synthesis of Attributed Feature Models From Product Descriptions
Synthesis of Attributed Feature Models From Product DescriptionsSynthesis of Attributed Feature Models From Product Descriptions
Synthesis of Attributed Feature Models From Product Descriptions
 
200612_BioPackathon_ss
200612_BioPackathon_ss200612_BioPackathon_ss
200612_BioPackathon_ss
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Introduction to OpenCV
Introduction to OpenCVIntroduction to OpenCV
Introduction to OpenCV
 
Observations
ObservationsObservations
Observations
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
UEMB270: Software Distribution Under The Hood
UEMB270: Software Distribution Under The HoodUEMB270: Software Distribution Under The Hood
UEMB270: Software Distribution Under The Hood
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Leveraging collaborativetaggingforwebitemdesign ajithajjarani

  • 1. Leveraging Collaborative Tagging for Web Item Design Mahashweta Das, Gautam Das , Vagelis Hristidis Presenter : Ajith C Ajjarani [1000-727269] 1/15/2012 1
  • 2. Outline : Organization of Presentation! Motivation & Problem Definition Naïve Bayes Classifier Tag Maximization : NP Complete Moderate Instances Larger Instances Exact 2 Tier : Top Approximation K Algorithm Algorithm Experiment & result Tabulation 1/15/2012 2
  • 3. Motivation Lets Define this Can I design a New Opportunity as Camera Which Attracts problem ! & maximizes the Tags ?? 1/15/2012 3
  • 4. Problem Construction ? Training Data  Attributes are product definition  Tags are user-defined Now, given subset of subjective “Desired“ Tags  predict a New Item( a combination of Attribute values) Extend this to “Top K” version for potential k Items with highest expected number of desirable Tags. 1/15/2012 4
  • 5. Problem Statement • Given a database of tagged products, task is to design k new products (attribute values) that are likely to attract maximum number of desirable tags – tag-desirability is just one aspect of product design consideration Zoom? Flash? • Applications Resolution? – electronics, autos, apparel – musical artist, blogger Light Sensitivity? Shooting mode?
  • 6. Tag Maximization Technically challenging, as complex dependencies exist between tags and items Difficult to determine a combination of attribute values that maximizes the expected number of desirable tags. “Naïve Bayes” Classifier for Tag Prediction. Even for this Classifier(assumption of simplistic Conditional Independence), Tag maximization problem is NP- Complete. Researchers have NOT resorted to Heuristics Developed Principal Algorithms 1/15/2012 6
  • 7. Proposed Solution  Exact – Top K Algorithm (ETT)  performs significantly better than naïve brute force algorithm. (No need to compute all possible products )  Application of Rank-Join and TA top-k algorithm in a two-tier architecture  In the worst case, may have exponential running time  Approximation Algorithm  (Poly Time Approximation Scheme) with provable Error bounds  The algorithm’s overall running time is exponential only in the (constant) size of the groups, but can be reduced to a polynomial time complexity.  For Large datasets
  • 8. Problem Framework Boolean Dataset • D = {o1, o2, ..., on} • A = {A1,A2, ..., Am} • T = {T1, T2, ..., Tr }  Each item is thus a vector of size (m + r) Eg : • Above such dataset has been used as a training set to build Naive Bayes Classifiers (NBC) & compute P (Tag | Attributes) 1/15/2012 8
  • 9. Derived Results The probability that a new item o is annotated by the tag Tj Probability Pr(Tj ‘ | o) of an item o not having tag Tj : 1/15/2012 9
  • 10. Derived Results Derived : Rj : Convenience Expected number of desirable tags Td = {T1, . . . , Tz} ⊆ T . new Item(o) is annotated with:
  • 11. Exact Algorithm • Naïve brute-force – Consider all possible 2m products and compute for each possible product – Exponential Complexity • Exact two-tier top-k (ETT) – Application of Rank-Join and TA top-k algorithm in a two-tier architecture – Does not need to compute all possible products • performs significantly better than naïve brute-force – Works well for moderate data instances, does not scale to larger data • In the worst case, may have exponential running time
  • 12. ETT: Two Tier Architecture Match these Items in tier-2 to compute global best product across all tags Determine “best” Item for each tag(T1,T2..Tz) in tier-1 Z – desirable Tags m‘ =m / l
  • 13. ETT Algorithm(Exemplification) • Database: {A1, A2, A3, A4 } and {T1, T2} and top-1 – Partition attributes into 2 groups {A1, A2} and {A3, A4 } to form 2 lists of Run NBCpartial products & Calculate – Each list has ( 2m‘ )  22= 4 entries (partial products) – Compute score for each partial product for each tag using and sort in descending order
  • 14. Buffer Product Complete Score MUS: sum of last seen Tier 2 Top-K () score from all .. .. GetNext() .. .. GetNext() = GetNext() = (A1 A2) (A3 A4) (A1 A2) (A3 A4) 10, 1.97 10, 1.97 11, 2.76 11, 4.57 Tier 1 00, 0.84 00, 0.84 01, 1.18 10, 2.53 11, 0.84 11, 0.84 10, 1.18 01, 0.91 01, 0.36 01, 0.36 00, 0.51 00, 0.51 L1 L2 L1 L2 T1 T2 Join Product Partial MPFS MPFS: Score .. .. .. .. Actual/ .. .. .. .. Complete : Score
  • 15. Buffer Product Complete Score Tier 2 Top-K () MinK (1.75) <= MUS (1.88) Iteration 1 1111 1.75 1010 1.70 Return to Tier 1 GetNext( ) = 1010 GetNext( ) = 1111 (A1 A2) (A3 A4) (A1 A2) (A3 A4) 10, 1.97 10, 1.97 11, 2.76 11, 4.57 Tier 1 00, 0.84 00, 0.84 01, 1.18 10, 2.53 Rank Join Join 11, 0.84 11, 0.84 10, 1.18 01, 0.91 01, 0.36 01, 0.36 00, 0.51 00, 0.51 L11 L12 L21 L22 T1 T2 Join Product Partial MPFS Join Product Partial MPFS Score Score 1 1010 0.95 >= 0.95 1 1111 0.93 >= 0.93 2 .. .. .. .. .. .. ..
  • 16. Buffer Product Complete Score Tier 2 Top-K () MinK (1.77) <= MUS (1.79) Iteration 2 1110 1.77 1011 1.76 Return to Tier 1 GetNext( ) = 1011 GetNext( ) = 1110 (A1 A2) (A3 A4) (A1 A2) (A3 A4) 10, 1.97 10, 1.97 11, 2.76 11, 4.57 Tier 1 00, 0.84 00, 0.84 01, 1.18 10, 2.53 Rank Join Join 11, 0.84 11, 0.84 10, 1.18 01, 0.91 01, 0.36 01, 0.36 00, 0.51 00, 0.51 L11 L12 L21 L22 T1 T2 Join Product Partial MPFS Join Product Partial MPFS Score Score 1 1010 0.95 0.95 1 1111 0.93 0.93 2 1011 0.92 >= 0.92 .2 1110 0.88 >= 0.88 .. .. .. ..
  • 17. Buffer Product Complete Score Tier 2 Top-K () MinK (1.77) <= MUS (1.74) Iteration 3 0111 1.77 0010 1.76 ETT Terminates GetNext( ) = 0010 GetNext( ) = 0111 (A1 A2) (A3 A4) Thus, ETT returns(A1 A2) the (A3 A4) 10, 1.97 10, 1.97 Best Item 11, 2.76 11, 4.57 Tier 1 (0111 or 1110) in Just 00, 0.84 00, 0.84 Item Look -up 1.18 6 01, 10, 2.53 Rank Join Join 11, 0.84 11, 0.84 10, 1.18 01, 0.91 01, 0.36 01, 0.36 00, 0.51 00, 0.51 L11 L12 L21 L22 T1 T2 Join Product Partial MPFS Join Product Partial MPFS Score Score 1 1010 0.95 0.95 1 1111 0.93 0.93 2 1011 0.92 0.92 .2 1110 0.88 0.88 3 0010 0.89 >= 0.89 3 0111 0.84 >= 0.84
  • 18. Approximation Algorithm Z Desirable tags € = 2σm Z/Z‘ Subgroups σ = Compression factor Z ‘ Tags Z ‘ Tags T1,T2… Tz ‘ T3,T4 … Tz ‘ Solved using Top K Items for PTAS Each Subgroup in polynomial time defined for O11,O12…O1k O21,O22…O2k Approximation factor € Overall Top K O1,O2…Ok Items 1/15/2012 18
  • 19. PTAS Algorithm Design For K = 1 & Z Desirable tags 1 Sub Group €>0 Z =Z ‘ Tags T1,T2… Tz ‘ PTAS Should run in Polynomial Time & Top K =1 Item Invariant for This Exact Score (Oa) >= (1- €) Exact Score (Og) Subgroup Oa Oa  PTAS returned Item Og  Optimal Item 1/15/2012 19
  • 20. PTAS Algorithm Design Simple exponential time exact top-1 algorithm for the sub-problem is created & then deduced to PTAS Given (m ) Boolean attributes and Z ‘ tags, the exponential time algorithm makes m iterations Initial step : Produces the set S0 consisting of the single item {0m} along with its Z ‘ scores, one for each tag. first iteration, it produces the set containing two items S1 = {0m, 10m−1} each accompanied by its Z ‘ scores, one for each tag. ith iteration, it produces the set of items Si = {{0, 1}i×0m−1} along with their z scores, one for each tag. final set Sm contains all 2m items along exact scores, from which the top-1 item can be returned, 1/15/2012 20
  • 21. PTAS Algorithm Design Consider this Table Z = Z‘ = 2 σ = 0.5 m=4 € = (2σm) = 4 Og = {1110} Oa = {1111} [1.77] = [0.89+0.88] [1.75] = [0.82+0.93] 1/15/2012 21
  • 22. PTAS Algorithm Design Cluster’s item’s exact underlined score should be close to the deleted item’s exact score. 1/15/2012 22
  • 23. Experiment Synthetic and real datasets for quantitative and qualitative analysis of proposed algorithms Quantitative performance indicators are :  efficiency of the proposed exact and approximation algorithm.  Obtained Approximation factor of results produced by the approximation algorithm Qualitative results of algorithms : Amazon Mechanical Turk user study to assess the results of algorithms. 1/15/2012 23
  • 24. Experiment Real Camera Dataset : Crawled a real dataset of 100 cameras listed at Amazon . The listed camera’s contain technical details (attributes) & tags customers associate with each camera. The tags are sanitized to remove synonyms, unintelligent and undesirable tags such as Nikon coolpix, quali, bad, etc. Synthetic Dataset : Boolean matrix of dimension 10,000 (items) × 100 (50 attributes +50 tags)  50 independent distributed attributes into 4 groups, where the value is set to 1 with probabilities of 0.75, 0.15, 0.10 and 0.05 50 tags, predefined relations by randomly picking a set of attributes that are correlated 1/15/2012 24
  • 25. Quantitative : Performance Exact Algorithm: • Synthetic dataset having 1000 items, 16 attributes and 8 tags (Naïve Vs ETT) 1/15/2012 25
  • 26. Quantitative : Performance Below figure, reveals that ETT is extremely slow beyond number of attributes (m) = 16 PA with an approximation factor =0.5, continues to return guaranteed results in reasonable time with increasing number of attributes m 1/15/2012 26
  • 27. Quantitative : Performance Execution time & obtained approximation factor  Synthetic dataset 1000 items, 20 attributes & 8 tags Top 1 Item is considered. 1/15/2012 27
  • 28. Qualitative : User Study First part of User study : PA algorithm with an approximation factor =0.5, by considering tag sets corresponding to compact cameras and slr cameras respectively. Built 4 new cameras (2 digital compact & 2digital slr) PA algorithm € =0. 5 Vs 4 existing popular cameras 65% of users choose the new cameras 1/15/2012 28
  • 29. Qualitative : User Study Second part of the study : Built 6 new cameras designed for three groups : 2 potential new cameras for each 1. young students Group 2. old retired 3. professional photographers. When asked with users to assign at least five tags : observation : majority of the users rightly classify the six cameras into the three groups 1/15/2012 29
  • 30. Conclusion  Define the Tag Maximization problem & investigate its computational complexity.  Propose 2 novel Algorithms & shown the practicability  This work is a preliminary look at a very novel area of research & promises exciting directions of future research.  Decision trees, SVMs, and regression trees classifiers are to used & Conduct the experiment 1/15/2012 30

Hinweis der Redaktion

  1. Tag maximization problem : How to decide the attribute values of new items (Ox) and to return the top-k “best” items that are likely to attract the maximum number of desirable tags