SlideShare a Scribd company logo
1 of 171
Download to read offline
Personalization:
Techniques and
applications
Krishnan Ramanathan, Geetha Manjunath,
Somnath Banerjee
HP Labs, Bangalore




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Topics
• Overview   of Personalization
• User Profile creation
• Personalizing Search
• Document modeling
• Recommender system
• Semantics in Personalization




2   22 January 2008
Overview of
Personalization
Krishnan Ramanathan




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Why Personalization ?
• Scale        of the web is limiting its utility
    • There is too much information
    • Consumer has to do all the work to use the web
    • Search engines and portals provide the same results for
      different personalities, intentions and contexts
• Personalization         can be the solution
    • Customize the web for individuals by
      • Filtering out irrelevant information
      • Identifying relevant information




4    22 January 2008
Some quotes from NY bits
•I   am married with a house. Why do I see so many
    ads for online dating sites and cheap mortgages?

    Should I be happy that I see those ads? It means
    Internet advertisers still have no idea who I am.




5     22 January 2008
Personalization
• Goal – Provide users what they need without requiring
  them to ask for it explicitly
• Steps
    • Generate useful, actionable knowledge about users
    • Use this knowledge for personalizing an application
• User centric data model – Data must be attributable to
  specific user
• Two kinds
    • Business Centric : Amazon, Ebay
    • Consumer Centric
•   Personalization requires User Profiling


6     22 January 2008
Applications of Personalization
• Interface            Personalization
    • E.g. Go directly to the web page of interest instead of
      site home page
• Content              personalization
    • Filtering (News, blog articles, videos etc)
    • Ratings based recommendations
      • Amazon, Stumbleupon
    • Search
      • Text, images, stories, research papers
    • Ads
• Service          Personalization

7    22 January 2008
Why is personalization hard ?
• Server          side personalization – Sites do not see all
    data
    • E.g. A user might visit Expedia and Orbitz, Expedia
      doesn’t know what the user did on Orbitz
• Difficult             to get user context
    • User needs to agree to cookies or login
• Site     profiles are not portable
    • Some standards are emerging (Attention profile markup
      language)
• Privacy


8     22 January 2008
Personalization example 1 (Routing
queries)
Google alerts             Google news page
                           routing queries




9   22 January 2008
Personalization example 2 - Amazon




10   22 January 2008
Personalization Example 3 – Google
news




11   22 January 2008
Personalization Example 4 – Yahoo
MyWeb




12   22 January 2008
The future …




13   22 January 2008
User Profile
Creation
Krishnan Ramanathan




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Outline
• User  Profile creation
• Profile Privacy
• Evaluating and managing user profiles
• Personalizing search




15   22 January 2008
User profile information
• Two        kinds of information
     • Factual (Explicit)
     • Behavioral (Implicit)
• Factual   – Geographic, Demographic,
     Psychographic information
     • Eg. Age is 25 years, searched for Lexus, lives in
       Bangalore
• Behavioral     – Describes behavioral activities (visits
     finance sites, buys gadgets)


16     22 January 2008
Client side versus Server side profiles
                                 Server side
Client side
                                 Have queries, clickstreams from
No access to clickstreams of
                                 multiple users
multiple users
                                 Don’t see all the user data
See all user data
                                 No way for users to aggregate
Possible for user to aggregate
                                 and reuse the profiles different
and reuse their attentional
                                 websites (Google, Yahoo, ..) build
information
                                 using their data
Strong privacy model
                                 Privacy is a big problem
Can access the full compute
                                 Server cycles have to be shared,
power at the client
                                 however some computations can
                                 be done once and reused


17   22 January 2008
Desired profile characteristics
• Represent multiple interests
• Adapt to changing user interests
• Incorporate contextual information




18   22 January 2008
Using User profiles to personalize services

                                                               Search query,   Content
                                                                news, video,
                                                                          …


                         Explicit
                         and                                   User
                         Implicit info      Profile            Profile
       Data                                                                 Profile to Content
     Collection                           Constructor                           Matching




        User                                                             Personalized services


                       Diagram adapted from Gauch et.al,
                       Chapter 2, The Adaptive Web, Springer
                       LNCS 4321
19   22 January 2008
User Profiling approaches
• Broadly two approaches
• IR approach
     • User interests derived from text (documents/search
       queries)
• Machine               learning approach
     • Model user based on positive and negative examples of
       his interests
     • Problems
       • Getting labeled samples
       • High dimensional feature space


20    22 January 2008
Profile building Steps
• Authenticate  the user
• Select information to build profile from and archive
  the information if necessary (eg. Web pages might
  get flushed from IE cache)
• Build/Refresh/Expand/Prune the profile
• Use it in an application
• Evaluate the profile




21   22 January 2008
Authenticating the user
• Users  need to be authenticated in order to attribute
  data to a particular user for profile creation
• Identifying single user
     • Login
     • Cookies
     • IP address (when it is static)
• Identifying           different users on same machine
     • Login
     • Biometrics


22    22 January 2008
Explicit user information collection
•    Ask the user for
     • static information
        • Name, age, residence location, hobbies, interests etc
        • Google personalization – found explicit information to be noisy
           • People specified literature as one of their interests but did not make
             a single related search
        • Matchmine – presents examples (movies, TV shows, music, blog topics)
          and asks the users to explicity rate them
     • Ratings
        • Netflix, Stumbleupon (thumbs up/down)
• In general, people do not like to give explicit information
  frequently
• Recent research (Jian Hu WWW 2007) showed good
  results for gender and age prediction based on users
  browsing behavior


23     22 January 2008
Explicit information collection:
Matchmine interface




24   22 January 2008
Implicit user information collection
• Data        sources
     • Web pages, documents, search queries, location
     • Information from applications (Media players, Games)
• Data        collection techniques
     • Desktop based
       • Browser cache
       • Proxy servers
       • Browser plugins
     • Server side
       • Web logs
       • Search logs


25    22 January 2008
How much implicit info to use ?
• Teevan             (SIGIR 2005) constructed two profiles
     • One with only search queries
     • Other using all information on desktop
• Findings
     • More richer information => better profile
     • All docs better than only recent docs better than only
       web pages better than only search queries better than
       no personalization
• Drawback      with implicit info – cannot collect info
     about user dislikes

26     22 January 2008
Stereotypes
• Generalizations        from communities of users
     • Characteristics of group of users
• Stereotypes alleviate the bootstrap problem
• Construction of stereotypes
     • Manual – e.g. Bangalore user will be interested in IT
     • Automatic method
       • Clustering – Similar profiles are clustered and common
         characteristics extracted




27    22 January 2008
How Acxiom delivers personalized ads
(source - WSJ)
•    Acxiom has accumulated a database of about 133 million households and
     divided it into 70 demographic and lifestyle clusters based on information
     available from public sources.
•    A person gives one of Acxiom’s Web partners his address by buying
     something, filling out a survey or completing a contest form on one of the
     sites.
•    Acxiom checks the address against its database and places a “cookie,” or
     small piece of tracking software, embedded with a code for that person’s
     demographic and behavioral cluster on his computer hard drive.
•    When the person visits an Acxiom partner site in the future, Acxiom can use
     that code to determine which ads to show
•    Through another cookie, Acxiom tracks what consumers do on partner Web
     sites



28      22 January 2008
Profile representation
•    Bag of words (BOW)
     • Use words in user documents to represent user interests
     • Issues
        •   Words appear independent of page content (“Home”, “page”)
        •   Polysemy (word has multiple meanings e.g. bank)
        •   Synonymy (multiple words have same meanings e.g. joy, happiness)
        •   Large profile sizes
•    Concepts (e.g. DMOZ)
     • Use existing ontology maintained for free
     • Issues
        • Too large (about 6 lakh DMOZ nodes), ontology has to be drastically
          pruned for use
        • Need to build classifiers for each DMOZ node



29     22 January 2008
Word based term vector profiles
• Profile represented as sets of words tf*idf weighted
• Could use one long profile vector or different
  vectors for different topics (sports, health, finance)
• Documents converted to same representation,
  matched with keyword vectors using cosine
  similarity
• Should take structure of the document into account
  (ignore html tags, email header vs body)



30   22 January 2008
Word based hierarchical profiles


     Support of                   User Profile:10
      Interest


                   Research:5           Sports:3.5          Sex:1.5


                 IR:3           DB:2     Soccer:2    Others:1.5

        Search:2         ...


      Support decreases from high to low level, and from left to right


 We are thankful to Yabo Arber-Xu from Simon Fraser University
 for kindly allowing us to use slides numbered 31,37,38,39 from
 his WWW 07 presentation.
31     22 January 2008
Building word based hierarchical
profiles
• Builda (word, document) map for each word
  occurring in the corpus
• Order words by amount of support
     • Support of a word = number of documents in which
       word appears
• For     each word
     • Decide whether to merge with another word (using some
       measure of similarity)
     • Decide whether to make one word the child of other



32    22 January 2008
33   22 January 2008
Term similarity and Parent-child terms
• Words that cover the same document sets are similar
• Jacquard measure

          Sim( w1, w2) =| D( w1 ) I D( w2 ) | / | D ( w1 ) U D ( w2 ) |


•    Parent child terms
     • A specific term is a child of a more general term if it frequently occurs
       with a general term (but the reverse is not true)
     • Word w2 is taken as child of term w1 if P(w1|w2) > some_threshold
     • e.g. Terms “Soccer” and “Badminton” might co-occur with the term
       “Sport” but not the other way around




34     22 January 2008
Personalization and Privacy
• Studies            have shown that
     • People are comfortable sharing preferences (favourite TV
       show, snack etc.), demographic and lifestyle information
     • People not comfortable sharing financial and purchase
       related information
        • Facebook fiasco because of reporting “Your friends bought …”
• Financial    rewards (even small amounts) encourage
     disclosure
     • People parted with valuable information for Singapore
       $15



35     22 January 2008
Privacy related attitudes
(Teltzrow/Kobsa 2003)




36   22 January 2008
What and How much to Reveal? - 1

                                                                         More
                                          User Profile:10                Sensitive


More
                           Research:5          Sports:3.5    Sex:1.5
specific


                           IR:3         DB:2     Soccer:2   Others:1.5

                      Search:2    ...



      Manual Option – Absolute privacy guarantee, but requires a lot of user
       intervention




 37      22 January 2008
What and How much to Reveal? - 2

      User Profile U à indicator of a user’s possible interests
      Term t à indicator of a possible interest,
                                           P(t)=Sup(t)/|D|


      The amount of information for an interest t
              I(t) = log(1/P(t))= log(|D|/ Sup(t)).
     àindication of the specificity and sensitivity of an interest

     H(U) – the amount of information carried by U
                                     H(U)=∑tP(t)×I(t)

      Two Privacy Parameters:

                  MinDetail - Protect t with P(t)<MinDetail
                  ExpRatio – H(U[exp] )/H(U)
       The more detail we expose, the higher expRatio.


38     22 January 2008
What and How much to Reveal? - 3

                                      User Profile:10         minDetail=0.5
                                                              expRatio=44%

                                                                     minDetail=0.3
                       Research:5          Sports:3.5    Sex:1.5     expRatio=69%




                       IR:3         DB:2     Soccer:2   Others:1.5

                  Search:2    ...



     The mindetail and expRation parameters allow a balance between privacy
       and personalization.




39   22 January 2008
Profile portability
• Move          the profile to a central server
     • Claria PersonalWeb, Google-Yahoo-Microsoft
     • Provision to delete search queries, visited pages
     • No control over which part of the profile can be used
• Have   a client side component that reconstructs the
  profile on the client using server side info
  (Matchmine)
• Attention Profile markup language
     • Allows explicit and implicit information to be stored (as
       XML) and provided to web services

40    22 January 2008
Attention Profile Markup
(http://www.apml.org)




41   22 January 2008
Application-independent evaluation of
the profile
•    Stability
     • Number of profile elements that do not change over the evaluation
       cycle
•    Precision
     • How many items in the profile does the user agree with as
       representative of his interests ?
     • Does the user agree with the strength of the interest ?
     • Do interests at deeper levels of the hierarchy have less precision
       compared to interests at higher levels ?
•    Which data sources (bookmarks, search keywords, web
     pages) is better ?
     • Bookmarks were not very representative of user interests in our study


42      22 January 2008
Profile evaluation
     Sample Evaluation of one profiling
     algorithm                                                                         0.8
                                                                                       0.7
                                                                                       0.6
                                                                                       0.5




                                                                       Stability
                                                                                                                                                           Stability_alpha
                                                                                       0.4
     •Profiles are stable (fig 1)                                                      0.3
                                                                                                                                                           Stability_date

                                                                                       0.2
     •Profile elements with high support                                               0.1

     have high precision (fig 2)                                                           0
                                                                                               0               200              400             600

     •Profile elements at all levels of the                                                              Number of web pages in cache


     hierarchy have similar precision (fig 3)
                                                                                                                 Figure 1


                    1                                                                      1.2

                  0.95                                                                         1

                   0.9                                                       Percent (%)   0.8
      Precision




                                                                                                                                                      Percentage in profile
                  0.85                                                                     0.6
                                                                                                                                                      Precis ion
                   0.8                                                                     0.4

                  0.75                                                                     0.2

                   0.7                                                                         0
                         Support > 5   3 < Support < 5   Support < 3                               Level 1 Level 2 Level 3 Level 4 Level 5 Level 6




                            Figure 2                                                                                 Figure 3
43                 22 January 2008
Managing the profile
• Profiles may need to be expanded (bootstrapped) or
  pruned
• Allowing users to manually edit their profiles to add/delete
  topics of interest was found to make performance worse
  (Jae-wook Ahn, WWW 2007)
     • Adding and deleting topics to profile harmed system performance
     • Deleting topics harmed performance four times more compared to
       adding topics
•    Some agents learn short term and long term profiles
     separately using different techniques (K-NN for short term
     interests, Naïve Bayes for long term interests)



44     22 January 2008
Personalizing
Search
Krishnan Ramanathan




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Personalized search
• Search            can be personalized based on
     • User profile
     • Current working context
     • Past search queries
     • Server side clickstreams
     • Personalized Pagerank
• Determining           user intent is hard (e.g query Visa)




46    22 January 2008
A generic personalized search algorithm
using a user profile
• Inputs- User profile, Search query
• Output – A results vector reordered by the user’s
  preference
• Steps
     • Send the query to a search engine
     • Results[] = A vector of the search engine’s results
     • For each item i in Results[] calculate the preference
        Pref [i] = α *Similarity(Results[i] , User Profile)
                         + (1- α)*SearchEngineRank
     • Sort Results[] using Pref [i] as the comparator

47     22 January 2008
Current working context – JIT retrieval
• Context  includes time, location, applications
  currently running, documents currently opened, IM
  status
• Use profile and current context to provide relevant
  (and just-in-time) information
     • Blinkx toolbar – provides relevant news, video and
       Wikipedia articles within different applications (Micrsoft
       Word, IE browser)
• Intersectinterests from the overall profile with
  current context to get the contextual profile
• Context can also be used in query expansion


48    22 January 2008
Personalization based on Search history
 • Use query-to-query similarity to suggest results that
   satisfied past queries
 • Create user profiles from past queries/snippets from
   search results clicked
     • Misearch (Gauch et.al 2004) creates weighted concept
       hierarchies based on ODP as the reference concept hierarchy
     • Compute degree of similarity between search engine result
       snippets (title and text summaries) and user profile as
                                       n
          sim ( user i , doc j ) =    ∑ wp
                                      k =1
                                               ik   * wd   jk


          wp ik = weight          of concept        k in profile i
          wd      jk   = weight   of concept        k in document    j

49   22 January 2008
Personalization by clickthrough data analysis
– CubeSVD (Jian-Tao Sun, WWW 2005)
• Search engine has tuples of the form (User, Query, Visited
  page)
• Multiple tuples constitute a tensor (generalization of matrix
  to higher dimensions)
• Higher order SVD (HOSVD) performs SVD on tensor
• The reconstructed tensor is a tuple of the form (User,
  Query, web page, p)
     • Where p is the probability that the user posing the query will visit
       the web page
     • Recommend pages with highest value of p
     • Computationally intensive but HOSVD can be done offline
        • Need to recompute to account for new clickthrough data


50     22 January 2008
Topic sensitive pagerank (Haveliwala
2002)
• For top 16 ODP categories, create a pagerank vector
   • Each web page/document d has multiple ranks
     depending on what the topic of interest j is
• For a query compute, P(Cj|q) = P(Cj)*P(q,Cj)
   • Intuition: If a topic is more probable given a query, the
     topic specific rank should have more say in the final
     rank
• Compute query sensitive rank as



                       ∑ P(C   j   | q) * rank jd

51   22 January 2008
Topics
• Overview   of Personalization
• User Profile creation
• Personalizing Search
• Document  modeling
• Recommender System
• Semantics in Personalization




52   22 January 2008
Document
modeling
Somnath Banerjee




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Under this topic
• Document              representation

• Document              analysis using
     • Latent Semantic Analysis (LSA)
     • Probabilistic Latent Semantic Analysis (PLSA)


• Document              Classification
     • Support Vector Machine (SVM): A machine learning
       algorithm


54    22 January 2008
Document representation
•    Term vector
      • Document is represented as vector of terms
      • Each dimension corresponds to a separate term


•    Several methods of computing the weights of the terms
      • Binary weighting: 1 if the word appear in the document


      • Most well known is TF*IDF
                                           ni , j
                          tf i , j =
                                       ∑n
                                       k
                                                k, j


                                                       D
                          idf i = log
                                              {d j : ti ∈ d j }
                          tfidf i , j = tf i , j × idf i
55      22 January 2008
Computing similarity

     sim ( A, B ) = cos ine (θ ) =
                                         A•B
                                                       =
                                                            ∑ A ×Bi   i

                                     A
                                         2
                                             × B
                                                   2
                                                           ∑A ∑B
                                                             i
                                                              2
                                                                          i
                                                                           2




                                               AI B
     Jaccard coefficien t = J ( A, B ) =
                                               AU B


                                              2 AI B
     Dice' s coefficien t = D ( A, B ) =
                                              A+ B

56    22 January 2008
Example
•    g1: Google Gets Green Light from FTC for DoubleClick Acquisition
•    g2: Google Closes In on DoubleClick Acquisition
•    g3: FTC clears Google DoubleClick deal
•    g4: US regulator clears DoubleClick deal
•    g5: DoubleClick deal brings greater focus on privacy


•    e1: EU Agrees to Reduce Aviation Emissions
•    e2: Aviation to be included in EU emissions trading
•    e3: EU wants tougher green aviation laws

•    Underlined words appeared in more than one documents




57      22 January 2008
Term Document Matrix (X)
                          g1   g2   g3   g4   g5   e1   e2   e3

google                    1    1    1    0    0    0    0    0
green                     1    0    0    0    0    0    0    1
ftc                       1    0    1    0    0    0    0    0
doubleclick               1    1    1    1    1    0    0    0
acquisition               1    1    0    0    0    0    0    0
clear                     0    0    1    1    0    0    0    0
deal                      0    0    1    1    1    0    0    0
eu                        0    0    0    0    0    1    1    1
aviation                  0    0    0    0    0    1    1    1
emmision                  0    0    0    0    0    1    1    0



58      22 January 2008
Retrieval example
•    Query (or Profile) q = “Google Acquisition”


•    Query vector         q = [1 0 0 0 1 0 0 0 0 0]'

•    Cosine similarity of the query to the documents

                           g1    g2     g3    g4   g5   e1   e2   e3
             S=
                          0.634 0.816 0.447   0    0    0    0    0



•    What about the documents g4 and g5?
      • Problem of data sparsity


59      22 January 2008
Under this topic
• Document              representation


• Document              analysis using
     • Latent Semantic Analysis (LSA)
     • Probabilistic Latent Semantic Analysis (PLSA)


• Document              Classification
     • Support Vector Machine (SVM) ): A machine learning
       algorithm


60    22 January 2008
Latent Semantic Analysis (LSA)
•    You searching for “Tata Nano” are not the documents
     containing “People’s Car” also relevant?


•    How a machine can understand that?
     • Analyze the collection of documents


     • Documents that contain “Tata Nano” generally contain “People’s
       Car” as well
        • Covariance of these two dimensions are high


     • LSA finds such correlation using a technique from linear algebra



61     22 January 2008
LSA
•    Transforms the term document matrix into a relation
     between the
     • terms and some concepts,
     • relation between those concepts and the documents


•    Concepts are the dimensions of maximum variance


•    Removes the dimensions with low variance
     • Reduction in feature space
     • Term document matrix becomes denser




62     22 January 2008
Singular Value Decomposition
            documents
                                                  •1
                                                       •2
                                                            •3
                                                              …        D'
terms
                     X        =          T        S               •m

                                                       mxm             mxd


                    txd                txm


      •1• •2 • … • •m>0
      m is the rank of the matrix X
      T and D are orthonormal matrix
      S is a diagonal matrix of singular values


 63      22 January 2008
Reduced SVD
            documents
                                                     •1
                                                          •2
                                                               •3          Dk '
                                                                 …
                             =        Tk             Sk
terms              Xk                                                •k

                                                          mxk              mxk



                  txd              txk



        -Choose largest k singular values (•1… •k)
        -Choose k columns of T and D
        -Then construct Xk
        -Xk is the best k rank approximation of X in terms of Frobenius norm
  64      22 January 2008
Example
•    g1: Google Gets Green Light from FTC for DoubleClick Acquisition
•    g2: Google Closes In on DoubleClick Acquisition
•    g3: FTC clears Google DoubleClick deal
•    g4: US regulator clears DoubleClick deal
•    g5: DoubleClick deal brings greater focus on privacy


•    e1: EU Agrees to Reduce Aviation Emissions
•    e2: Aviation to be included in EU emissions trading
•    e3: EU wants tougher green aviation laws


•    Query (or Profile) q = “Google Acquisition”


65      22 January 2008
Term Document Matrix (X)
                          g1   g2   g3   g4   g5   e1   e2   e3

google                    1    1    1    0    0    0    0    0
green                     1    0    0    0    0    0    0    1
ftc                       1    0    1    0    0    0    0    0
doubleclick               1    1    1    1    1    0    0    0
acquisition               1    1    0    0    0    0    0    0
clear                     0    0    1    1    0    0    0    0
deal                      0    0    1    1    1    0    0    0
eu                        0    0    0    0    0    1    1    1
aviation                  0    0    0    0    0    1    1    1
emmision                  0    0    0    0    0    1    1    0



66      22 January 2008
LSA Example

 T(10x7) =




                         S(7x7) =




D‘(7x8) =



67     22 January 2008
LSA Example
• Rank       2 approximation of X

                           documents




 terms




68   22 January 2008
LSA Example
•    Query (or Profile) q = “Google Acquisition”
•    Query vector q = [1 0 0 0 1 0 0 0 0 0]'


•    Representation of the query
       Dq = q'T2S2 -1 = [-0.204 0.005 ]


•    Query to document similarity
       Sim = Dq S22 D2'




69     22 January 2008
LSA Example
     Dq                         S22

                            X         X
                                          D2'




Sim =




70        22 January 2008
Example
•    g1: Google Gets Green Light from FTC for DoubleClick Acquisition   [1.28
                                                                        4]
•    g2: Google Closes In on DoubleClick Acquisition [0.936]
•    g3: FTC clears Google DoubleClick deal [1.426]
•    g4: US regulator clears DoubleClick deal [0.891]
•    g5: DoubleClick deal brings greater focus on privacy [0.697]


•    e1: EU Agrees to Reduce Aviation Emissions [0.035]
•    e2: Aviation to be included in EU emissions trading [0.035]
•    e3: EU wants tougher green aviation laws [0.152]

•    Underlined words appeared in more than one documents




71      22 January 2008
Under this topic
• Document              representation


• Document              analysis using
     • Latent Semantic Analysis (LSA)
     • Probabilistic Latent Semantic Analysis (PLSA)


• Document              Classification
     • Support Vector Machine (SVM) ): A machine learning
       algorithm


72    22 January 2008
Probabilistic Latent Semantic Analysis
(PLSA)
• If   we know the document collection contains two
     topics can we do better?
     • Can we estimate
        • Probability( topic | document) ?
        • Probability( word | topic) ?


     • If we can also estimate Probability( topic | query) then we
       can compute the document to query similarity


•    PLSA is a statistical technique to estimate those probability
     from a collection of documents

73     22 January 2008
Probabilistic Latent Semantic Analysis
(PLSA)
• Dyadic      data: Two (abstract) sets of objects, X ={x1,
     ..,xm} and Y ={y1, … ,yn} in which observations are
     made of dyads(x,y)
     • Simplest case: observation of co-occurrence of x and y
     • Other cases may involve scalar weight for each
       observation

• Examples:
     • X = Documents, Y =Words
     • X = Users, Y =Purchased Items
     • X = Pixels, Y =Values


74     22 January 2008
PLSA
•    Document consists of topics and words in the document are generated
     based on those topics


•    Generative model (asymmetric): (di, wj) is generated as follow
      • pick a document with probability P(di),
      • pick a topic zk with probability P(zk | di),
      • generate a word wj with probability P(wj | zk)


                                                    (       )            (
                                                  P d i , w j = P(d i )P w j | d i   )
     P(di)        P(zk |di)       P(wj |zk)

                                                    (        ) ∑ P(w j | z k )P(z k | d i )
                                                                  K
             D                Z               W
                                                  P w j | di =
                                                                 k =1




75       22 January 2008
PLSA
•    Parameters P(di), P(zk | di), P(wj | zk)
      • P(di) is proportional to number of times the document is observed and be
        computed independently
      • P(zk | di), P(wj | zk) can be estimated using Expectation Maximization
        (EM) algorithm



                                      ∏∏ P(d , w )
                                      N       M
              P ( D, W ) =                            i   j
                                                              n(di ,w j )

                                      i =1 j =1


                          ∑∑ n(d , w )ln P(d , w )
                          M    N
              L=                          i       j       i      j
                          i =1 j =1

             M = Number of documents; N = Number of distinct words


76      22 January 2008
PLSA: EM steps
•    E-Step:                (             )
                          P z k | di , w j =
                                                       (            )
                                                 P w j | z k P(z k | d i )

                                               ∑ P(w                    )
                                                 K

                                                               j   | zl P(zl | d i )
                                                l =1




     M-Step:
                                                  ∑ n(d , w )P(z                                          )
•                                                      N

                                                                    i       j            k   | di , w j
                           (
                         P w j | zk = )         i =1
                                               M N

                                              ∑∑ n(d , w
                                               m =1 i =1
                                                                        i       m   )P(z k | d i , wm )

                                              ∑ n(d , w )P(z                                         )
                                               M

                                                           i            j           k   | di , w j
                         P (z k | d i ) =
                                               j =1
                                                                     n( d i )
77     22 January 2008
PLSA Example
                                 g1   google        e1   eu
                                 g1   green         e1   aviation
     •Dyadic       data in our   g1   ftc           e1   emission
     example
                                 g1   doubleclick   e2   aviation
                                 g1   acquisition   e2   eu
                                 g2   google        e2   emission
                                 g2   doubleclick   e3   eu
                                 g2   acquisition   e3   green
                                 g3   ftc           e3   aviation
                                 g3   clear
                                 g3   google
                                 g3   doubleclick
                                 g3   deal
                                 g4   clear
                                 g4   doubleclick
                                 g4   deal
                                 g5   doubleclick
                                 g5   deal
78     22 January 2008
PLSA Example
• After      20 iterations of EM algorithm




                   P(zk |di)


                                             P(wj |zk)

79   22 January 2008
PLSA Example
• Query q = “Google Acquisition”
• Steps
     • Keep P(wj |zk) fixed.
     • Estimate P(zk |q) using EM steps
     • Then compute cosine similarity of the vector P(Z|q) to the
       P(Z|d)



                                 Z1       Z2
                          q      1        0
                               P(zk |q)




80     22 January 2008
Example
•    g1: Google Gets Green Light from FTC for DoubleClick Acquisition   [1.0]

•    g2: Google Closes In on DoubleClick Acquisition [1.0]
•    g3: FTC clears Google DoubleClick deal [1.0]
•    g4: US regulator clears DoubleClick deal [1.0]
•    g5: DoubleClick deal brings greater focus on privacy [1.0]


•    e1: EU Agrees to Reduce Aviation Emissions [0.0]
•    e2: Aviation to be included in EU emissions trading [0.0]
•    e3: EU wants tougher green aviation laws [0.0]

•    Underlined words appeared in more than one documents




81      22 January 2008
Under this topic
• Document              representation

• Document analysis using
     • Latent Semantic Analysis (LSA)
     • Probabilistic Latent Semantic Analysis (PLSA)


• Document              Classification
     • Support Vector Machine (SVM) ): A machine learning
       algorithm


82    22 January 2008
Document Classification




83   22 January 2008
Document classification with SVM
•    We will concentrate on binary classification
     • {sports, not sports}, {interesting, not interesting} etc
     • In general {+1,-1} also called {positive, negative}

•    SVM is a supervised machine learning technique. It learns
     the pattern from a training set

•    Training set
     • A set of documents with labels belonging to {+1, -1}

•    SVM tries to draw a hyperplane that best separates the
     positive and negative data in the training set


84     22 January 2008
Support Vector Machine (SVM)
•    A Machine learning algorithm


•    SVM was introduced in COLT-92 by Boser, Guyon and
     Vapnik.


•    Initially popularized in the NIPS community, now an
     important and active field of all Machine Learning
     Research


•    Successful applications in many fields (text, bioinformatics,
     handwriting, image recognition etc.)

85     22 January 2008
SVM – Maximum margin separation




                        SVM illustration by
                              Bülent Üstün
                       Radboud Universiteit




86   22 January 2008
Mapping to higher dimension for non-
separable data
                                     P1   • (0,0) x {+1}
                                     P2   • (0,1) x {-1}
 P2                         P3
                                     P3   • (1,1) x {+1}
                                     P4   • (1,0) x {-1}

     P1                     P4

                                     P1   •   (0,0,0) x {+1}
                            x  2
                               1

              x → φ (x ) →  x  2
                                 2
                                     P2   •   (0,1,0) x {-1}
                           x x     P3   •   (1,1,1) x {+1}
                            1 2
                                     P4   • (1,0,0) x {-1}
87        22 January 2008
The XOR example




                       SVM uses kernel trick to map data to higher
                       dimensional feature space without incurring
                       much computational overhead

88   22 January 2008
Recommender
System
Select top N items for a user
-Somnath Banerjee




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Example




90   22 January 2008
Example




91   22 January 2008
Classification
• Broadly               three approaches
     • Content Based Recommendation
     • Collaborative Filtering
     • Hybrid approach




92    22 January 2008
Content based recommendation
• Utility   of an item for a user is determined based on
     the items preferred by the user in the past

• Applies  similar techniques as introduced in the
     document modeling part




93     22 January 2008
Basic Approach
•    Create and represent the user profile from the items rated by the user in
     the past
      • A popular choice of profile representation is vector of terms weighted
        based on TF*IDF

•    Represent the item in the same format
      • A news item can be represented using (TF*IDF) term vector
      • For movies, books one needs to get sufficient metadata to represent the
        item in vector format

•    Define a similarity measure to compute the similarity between the
     profile and the item
      • Popular choice is cosine similarity
      • Advance machine learning techniques can also be applied to do the
        matching

•    Recommend most similar items


94      22 January 2008
Problems with content based
recommendation
•    Knowledge engineering problem
      • How do you describe multimedia, graphics, movies,
        songs


•    Recommendation shows limited diversity


•    New user problem
     • It requires large number of ratings from the user to generate
       quality recommendation




95     22 January 2008
Collaborative filtering
•    Recommends items that are liked in the past by other users
     with similar tastes


•    Quite popular in e-commerce sites, like Amazon, eBay


•    Can recommend various media types, text, video, audio,
     Ads, products




96     22 January 2008
97   22 January 2008
Advantages
• Does    not have the knowledge engineering
     problem
     • Both user and items can be represented using just ids


• Often     recommendation shows good amount of
     diversity




98     22 January 2008
Lets learn C.F. with an example


                    Ran   Casablanca   Ben   Tomb Raider   MI -II   Air Force
                                       Hur                            One
 Jane                 5       5                              ?         2
 Bill                         2                  3          4
 Tom                  2                2                    5          5
 Cathy                3       3                             1          1


 What rating Jane will possibly give to MI – II?


99      22 January 2008
Normalizing the ratings
•   All users won’t give equal rating even if they all equally liked/disliked an
    item
      • Normalize rating            r = ru ,i − ru


                          Ran   Casablanca    Ben    Tomb     MI -II   Air Force
                                              Hur    Raider              One
      Jane                1        1                                      -2
      Bill                         -1                  0        1
      Tom            -1.5                    -1.5             1.5        1.5
      Cathy            1            1                          -1         -1



100     22 January 2008
Similarity between users
•   Who are the other users with similar taste like Jane
      • Each row of the matrix is a vector representing the user
      • Compute cosine similarity between the users




                                Bill           Tom           Cathy
             Jane             -0.289         -0.612          0.816




101     22 January 2008
Compute probable rating
•   Possible rating is the rating given by the other users weighted by the
    similarity
      • Sometimes only top N similar users are taken

                                         ∑ sim(u, v )∗ (r − r )
                                         v∈V
                                                       v ,i   v
                          ru ,i = ru   +
                                            ∑ sim(u, v )
                                               v∈V




    Jane will rate MI-II as                   (−0.289 × 1) + (−0.612 × 1.5) + (0.816 × −1)
•                                        = 4+
                                                        0.289 + 0.612 + 0.816
                                         ≈ 2.82


102     22 January 2008
Remarks
•   There is another popular version of the above technique
    where instead of user to user similarity item to item
    similarity is computed
      • Rating prediction is based on the similarity to the items rated by the
        user


•   The above mentioned methods are known as memory
    based techniques
     • It has the disadvantage that it require more online
       computations




103     22 January 2008
Model based technique
•   A model is learnt using the collection of ratings as training
    set


•   Prediction is done using the model


•   More offline computing and less online computing




104   22 January 2008
Model based technique
•   A simple model



            ru ,i = E (ru ,i ) =   ∑ r × Pr (r
                                             u ,i   = r | ru , s′ , s ′ ∈ I u )
                                   r∈R




105   22 January 2008
Model based technique
•   Recent research tries to model the recommendation process
    with more complex probabilistic models

                              u         z



                              i         r



       P(r | u , i ) =   ∑ P(r | z, i )× P(z | u )
                          z

•   Parameters P(r|z,i) and P(z|u) can be estimated using EM
    algorithm


106   22 January 2008
Problems of C.F.
• New          user problem

• New          Item problem

• Sparsity               problem
      • A user rates only a few items

• Unusual                user
      • User whose tastes are unusual compared to the rest of
        the population

107    22 January 2008
Hybrid approaches
- Combining Collaborative and Content based methods

• Combining              predictions of Content based method
  and C.F.
      • Implement separate content based and collaborative
        filtering method


      • Combine their predictions using
        • Linear combination
        • Voting schemes


      • Alternatively select a prediction method based on some
        confidence measure on the recommendation

108    22 January 2008
Hybrid Approaches
• Adding content based characteristics into a C.F.
  based method
      • Maintain a content based profile for each user
      • Use these content based profiles (not the commonly
        rated items) to compute the similarity between users
      • Then do C.F.
      • Helps to overcome sparsity related problems as
        generally not many items are commonly rated two users




109    22 January 2008
Hybrid approaches
• Adding                 C.F. characteristics into a content based
  method
      • Most popular techniques in this category is
        dimensionality reduction on a group of content based
        profiles


      • Dimensionality reduction technique like LSA can improve
        prediction quality by having compact representation of
        profile




110    22 January 2008
Future directions of research                                    (Adomavicious et
al)

•   Incorporating richer user and item profile in a unified
    framework of different methods

•   Using contextual information in recommendation
      • Example: Recommending a vacation package the system should
        consider
         •   User
         •   Time of the year
         •   With whom the user plans to travel
         •   Traveling conditions and restrictions at the time


•   Multi-Criteria ratings
      • E.g. three criteria restaurant ratings food, décor and service


111     22 January 2008
Future directions of research
• Non-intrusiveness


• Flexibility
      • Enabling end-users to customize recommendation


• Evaluation
      • Empirical evaluation on test data that users choose to
        rate
        • Items that users choose to rate are likely to be biased
      • Economics-oriented measures

112    22 January 2008
References (Recommender System)
•     Adomavicius, G., and Tuzhilin, A., “Toward the Next
      Generation of Recommender Systems: A Survey of the
      State-of-the-Art and possible Extensions”, IEEE
      Transaction on Knowledge and Data Engineering, 2005




113   22 January 2008
Semantics
in Personalization

Geetha Manjunath
Hewlett Packard Labs India




 © 2006 Hewlett-Packard Development Company, L.P.
 The information contained herein is subject to change without notice
Topic Outline
• Why   use semantic information?
• Introduction to Ontology
• Formal Specification of an Ontology
      • A Quick Overview of Semantic Web
• Techniques               and Approaches
      • Word Sense Disambiguation
      • Semantic Profiles
      • Constrained Spreading Activation
      • Semantic Similarity
• Looking                Ahead

115    22 January 2008
News Example Revisited
•   g1: Google Gets Green Light from FTC for DoubleClick Acquisition
•   g2: Google Closes In on DoubleClick Acquisition
•   g3: FTC clears Google DoubleClick deal
•   g4: US regulator clears DoubleClick deal
•   g5: DoubleClick deal brings greater focus on privacy


•   e1: EU Agrees to Reduce Aviation Emissions
•   e2: Aviation to be included in EU emissions trading
•   e3: EU wants tougher green aviation laws

•




116    22 January 2008
News Example Modified
•   g1: Apple Gets Green Light from FTC for TripleClick Acquisition
•   g2: Apple Closes In on TripleClick Acquisition
•   g3: FTC clears Apple TripleClick deal
•   g4: US regulator clears TripleClick deal
                                                             IT company
                                                                Google
•   g5: TripleClick deal brings greater focus on privacy
                                                             Acquisition
                                                              Acquisition

•   e1: EU Agrees to Reduce Aviation Emissions
•   e2: Aviation to be included in EU emissions trading
•   e3: EU wants tougher green aviation laws


•   f1: Apple prices soaring high.
•   f2: Increased apple rates causes concern to doctors.
•   f3: Cost of 10 kg of apple to become Rs 1000 from 1 Feb.
117    22 January 2008
Semantics for Personalization
                                                Profile
                                                Represent        Search query,
                                             Representation                      Content
                                                Profiles as       news, video,
                                             using domain                   …
                                               meaningful
                                               concepts
                                                 concepts
                             Explicit
                             and                                 User
                             Implicit info          Profile      Profile
        Data                                                                  Profile to Content
      Collection                                  Constructor                     Matching

                                                                Semantics
                                                                  based
                                                                Matching
                                                                Function
                         Implicit             Expand the
        User                                                               Personalized services
                                                                             Cluster
      Documents         Info based             generated
                                                                           documents
                        on domain            profile using
                                                                            based on
                        knowledge            domain info
                                                                           better User
118   22 January 2008                                                        groups
Techniques and Approaches
1. Implicit Information based on domain knowledge
      • Word Sense Disambiguation
2. Represent Profiles as meaningful concepts
      • Semantic Profiles
3. Semantics based Matching Function
      • Semantic Distance
4. Expand the generated profile using domain info
      • Constrained Spreading Activation
5. Cluster documents based on better User groups
      • Social Semantic Networks

119    22 January 2008
Word Sense disambiguation
                   Animal
                                     Using Wordnet    Transport



                 Mammal                               Vehicle
 Hyponyms




                                        Meronyms
                Carnivore                            Motor Vehicle
                                        tail
                                                                     Accelerator
                                          fur
                   Feline                            Automobile       Door
                                           nail
                              contains                                 Bumper

                  Big cat                               Car          Wheel

        type of                    Synonyms
                   Jaguar                Panther       Jaguar
                              same as
120         22 January 2008
Word Sense disambiguation
            Abstract
                                                      entity
             entity


             group                                  substance

                               employee
        organization            Advisory board         solid
                                 stocks                          eat
                                                                  animal
          institution                                  food        plant
                             Revenue                            ripe
                   Business Acquisition                           tree
           company           Sales tax                fruit       plant
                                                                skin
                                …..                               seed
             Apple                                   Apple         pulp


                        KEY: Additional domain information
121   22 January 2008
Three level Conceptual Network

                         • Domain Ontology




                         • Co-occurrence
                         • synonyms
                         • hyponyms
                         • ..


                         • Hyperlinks
                         • Order of access
                         • Browsed together
                         •…

122   22 January 2008
Introduction to
Ontologies



© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Views on Ontologies
                                              TopicMaps             Front-End
                         Thesauri
                                          Navigation
Taxonomies                   Information Retrieval
                  Query Expansion                Sharing of Knowledge


                        Queries
                                  Ontologies              Semantic Networks
                                                       Consistency Checking
                                               EAI
                                  Mediation
                                                     Reasoning

  Extended ER-Models
                                              Predicate Logic
                                                                     Back-End

124   22 January 2008
Structure of an Ontology
 Ontologies typically have two components:
 • Names for important concepts in the domain
      • Elephant is a concept whose members are a kind of
        animal
      • Herbivore is a concept whose members are exactly
        those animals who eat only plants or parts of plants
 • Background            knowledge/constraints on the
      domain
      • No individual can be both a Herbivore and a
        Carnivore



125    22 January 2008
A Simple Ontology
                                                               Object
                                         Is a                                 Is a
                                                knows                      Described in
                  Person                                       Topic                       Document
                                            writes
        Is a
      Student                  Researcher                  Semantics       Ontology
        Is a
                                                                   similar
           PhD Student

                          Described in                                               Is about
        Topic                              Document                     Document                    Topic

                                                Is about
                  writes                                                                        knows
      Person                 Document                      Topic               Person                   Topic

126     22 January 2008
Defining Ontology
[Gruber, 1993]
An Ontology is a
formal specification       Ø   Executable
of a shared                Ø   Group of persons
conceptualization          Ø   About concepts
of a domain of interest.   Ø   Application & “unique truth”


•Formal description of concepts and their relationships
•Strong Basis in the family of First Order Logics (DL)
•Deductive Inference based on ground truth of the domain.



127   22 January 2008
Formal
Specification of
Ontologies
Semantic Web: A quick introduction




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
The Semantic Web Vision
Semantic web aims to transform WWW into a global database




“The semantic web is a web
for computers”

129   22 January 2008
Semantic web
Make web resources more accessible to automated processes
•   Extend existing rendering markup with semantic markup
      • Metadata annotations that describe content/funtion of web
        accessible resources
•   Use Ontologies to provide vocabulary for annotations
      • “Formal specification” is accessible to machines
•   A prerequisite is a standard web ontology language
      • Need to agree common syntax before we can share semantics
      • Syntactic web based on standards such as HTTP and HTML




130     22 January 2008
Semantic Web Layers




                          Context for
                          vocabulary
                                          Globally
      User definable,                   Unambiguous
      domain specific                    Identifiers
         markup




131     22 January 2008
What is RDF ?
• RDF  – resource description framework
• RDF is a data model
• Statement-based approach
      • Subject/predicate/object triples – simple powerful unit
      • All resources identified by URIs
      • Triples create a directed labelled graph of
         • object/attribute/value
         • (semantic) relationships between objects
• RDF        model is an abstract layer independent of XML
      • XML serialization is supported


132    22 January 2008
RDF Example
                                                                   resource                   value
                       ../presentation.ppt                                         property
                                                  dc:creator
      dc:date                dc:description
                                                           people.com/../dave_reynolds

                           Some starter slides…
                                                                             org:email

       2005-09-23
                                                               mailto:dave.reynolds@hp.com



 <rdf:Description rdf:about=“allppt.com/presentation.pptquot;>
  <dc:creator resource=“people.com/person/dave_reynoldsquot;/>
 </rdf:Description>
  Enables easy merge of information
 <rdf:Description rdf:ID=“people.com/person/dave_reynoldsquot;>
      • Indirect metadata (anyone can say anything about anything)
 <org:email resource= “mailto:dave.reynolds@hp.com” />
      • Extensibility (open world assumption, compositional)
 </rdf:Description>
133      22 January 2008
RDF Schema
• Defines                 small vocabulary for RDF:
      • Class, subClassOf, type
                      rdfs:Resource
      • Property, subPropertyOf
                                    rdfs:subClassOf
      • domain, range
                                              Veh: MotorVehicle
• Vocabulary can be used to define other
  vocabularies for yourrdfs:subClassOf
                        application domain
                   Veh: Van                                       Veh: Truck

                                         Veh: PassengerVehicle


                            rdfs:subClassOf

                          Veh: MiniVan


134     22 January 2008
OWL – Web Ontology Language
• A language to express an ontology
• An OWL ontology is an RDF graph
      • A set of RDF triples
      • Vocabulary Extension
                                                        Domain Restrictions/Truth
•   Structure
      • Ontology headers                            Important Concepts of the Domain

      • Class Axioms
         • Class Descriptions, Enumeration, Membership Restrictions
      • Property Axioms
         • Property Descriptions, Property Restrictions, Functional Spec
      • Facts about individuals



135     22 January 2008
OWL Class Constructors




136   22 January 2008
The Syntax
Parent = Person with at least one child


 <owl:Class rdf:ID=“Parent”>
   <owl:intersectionOf >
     <owl:Class rdf:about=quot;#Personquot;/>
     <owl:Restriction>
       <owl:onProperty rdf:resource=quot;#hasChildquot;/>
       <owl:minCardinality>1</owl:minCardinality>
    </owl:Restriction>
   </owl:intersectionOf>
 </owl:Class>




 137   22 January 2008
OWL Axioms




138   22 January 2008
SPARQL
 •     RDF Query Language
      • Triples with unbound variables
 •     Protocol
      • HTTP binding
      • SOAP binding
 •     XML Results Format
      • Easy to transform (XSLT, XQuery)




139   22 January 2008
Why Ontologies?
•   Enable formalisation of user preferences
      • Common underlying, interoperable representation
      • Public vocabulary agreed & shared between different systems
      • Better content matching & sharing across applications
•   User interests can be matched to content meaning
      • Using conceptual reasoning
• Richer, more precise, less ambiguous than keyword-based
• Provides adequate grounding for hierarchical representation
      • coarse to fine-grained user interests
•   Formal, computer processable meaning on the concepts


140     22 January 2008
Semantic User
Profiles




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Semantic Profiles
• User         Profile as concepts
      • Books, Clothes and Soccer                                               Web
                                                                                pages
                                                                                visited by
                                               Top                              the user



             Shopping               Science            Sports             ..…      ..…
                 W=2                 W=0                W=1




  Books                   Clothes             Soccer            Cricket
      W=1                  W=1                 W=1               W=0




                     How do we map documents/users to concepts?
142     22 January 2008
Building concept profiles based on ODP
The Machine Learning Approach


       ODP
                                       Training                     ODP
       categories +
                                                                    classifier
       documents


                    Step 1: Build ODP classifier for selected ODP categories




       User web                             ODP                           ODP
       pages                                classifier                    concepts



                    Step 2: Use user data and ODP
                    classifier to build the user profile
                                                                          Add to
                                                                          profile
143   22 January 2008
Topic Hierarchy from ODP / DMOZ




144   22 January 2008
Using Wikipedia to map documents to
concepts
Item: “Sony to slash PlayStation3 price”
Term vector Representation: <sony:1>,<slash:1>, <playstation3:1>,<price:1>

Item: “Jittery Sony Knocks $100 Off PS3 Price Tag”
Term vector Representation: <jittery:1>, <sony:1>, <knocks:1> <ps3:1>,<price:1>, <tag:1>


                                                                             Additional features: titles of the retrieved
                                                                                   articles

                    query                                                    1.    PlayStation Network Platform
                                                                             2.    PlayStation 2
                                                                             3.    Ducks demo
                                                                             4.    PlayStation 3
 Sony to slash PlayStation3 price                                            5.    PlayStation
                                                                             6.    Ken Kutaragi
                                                                             7.    PlayStation Portable
                                                                             8.    Console manufacturer
                                                                             9.    Sony Group
                                    Index of Wikipedia dump                  10.   Crystal Dynamics
                                                                             11.   PlayStation 3 accessories
                                                                             12.   …
                                                                             13.   …




  A Search Approach

145     22 January 2008
Profile: Words Vs Concepts
             TF * IDF based user profile           Wikipedia Based user profile

 Search                                    Text Retrieval Conference
 Home                                      HTML element
 Help                                      Bank of America
 News                                      Google search
 Privacy                                   ICICI Bank
 Google                                    IDBI Bank
 Terms                                     Bank fraud
 New                                       Artificial neural network
 Page                                      Web crawler
 Use                                       Web design
 Web                                       Debit card
 View                                      Extensible Markup Language
 Results                                   Hewlett-Packard
 Information                               Microsoft
 Account                                   XHTML
                                           Demand account



146      22 January 2008
Semantic Profiles
• Vector  of weights – representing the intensity of user
  interest for each concept (-1 to 1)
• Content also described by a set of weighted concepts
  (0 to 1)

• Concept                Profiles: Can express fine grained interests
      • Interest in atheletes who have won a gold medal
      • Interest in IT companies which have acquired atleast 3
        companies in the last one year
      • Only movies with either Amitabh or Sharukh

147    22 January 2008
Ontology-based
Profile Spreading




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Profile Expansion
• Use       inference mechanism to enhance personalisation
      • Synonym expansion
      • Interest in multiple subclasses implies broader interest
      • Transitive closure (locatedIn, subtopic)
      • Interest in superclass leads to potential interest in subclass
      • Guess changing interest over time




149    22 January 2008
Constrained Spreading

                          Artificial
                         Intelligence




           Machine
           Learning




                         Neural
                        Networks




150   22 January 2008
Constrained Spreading Activation
• Cannot
       take ‘all’ related data
• Commonly used SA models
      • Distance Constraint
      • Fan-out Constraint
      • Path Constraints
        • App dependent inference rules
        • Type of relationship
        • Preferential paths
      • Activation Constraint
        • Threshold function at each single node level


151    22 January 2008
Learning preferences using semantic links
Two main ways of updating Concept History Stack
1. Interest Assumption Completion
      • Add more potential user interests
      • Based on Hierarchical relationships
        • Threshold on value of pseudo-occurrence for insertion
              • Nocc (C     supertype)   = γ * Nocc (C   subtype)
                    where γ < 1 is determined empherically
      • Based on Semantic relationships
              • All related concepts such that ∃ prop p, p (C, C related)
              • Pseudo-occurrence Nocc (Crelated) = αi* Nocc (C)



152    22 January 2008
Learning preferences using semantic links
(contd)
2. Preference update by expansion
      •     Re-weighting over time
           • Wnew (Crelated) = Wold (Crelated) + βi * Wnew(C)
           • βI – Semantic Factor that depends on the level of semantic proximity
              • Directly part of definition (Tbox)
              • Related through inferred transitive relation (# such links matter)
           • Notion of Semantic distance




153       22 January 2008
Semantic Similarity




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Similarity/Matching
• Cosine                 similarity
      •U represents user preference
      •D represents content object
      •Dimension: #concepts in the ontology



similarity ( U, D )= cos (U ,D) =
                                       U• D
                                                  =
                                                       ∑ U ×Di   i

                                      U × D
                                       2      2
                                                      ∑U ∑D
                                                        i
                                                         2
                                                                     i
                                                                      2




155    22 January 2008
Semantic distance d(x,y,c)
• Semantic distance between 2 nodes x and y is defined
  with respect to a concept, c
• Example: a black cat and an orange cat
      • very similar as instances of the category Animal, since their
        common catlike properties would be the most significant for
        distinguishing them from other kinds of animals.
      • But in the category Cat, they would share their catlike properties
        with all the other kinds of cats, and the difference in color would be
        more significant.
      • In the category BlackEntity, color would be the most relevant
        property, and the black cat would be closer to a crow or a lump of
        coal than to the orange cat.




156     22 January 2008
Semantic Similarity




                        Mtrl: Material
                        Accm: Accompaniment


157   22 January 2008
Using Wordnet (hypernyms)




158   22 January 2008
Match all nodes




159   22 January 2008
Similarity Formula




160   22 January 2008
Looking Ahead




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice
Contextual Personalisation
• Finer,
       qualitative, context sensitive activation of
 user pref
• Notion of a Semantic Runtime Context
      • Representation: Vector of concept weights
• Fuzzy  semantic intersection between user
  preferences and runtime context
      • Using Constrained spreading activation




162    22 January 2008
Semantic Social Networking
• Identify           hidden links between users
      • Similarity between user preferences
• Collaborative            Recommender systems
      • Use of Global Preferences not correct
      • Partial & Strong Similarities are very useful
      • Eg: Coinciding interest in cinema but drastically different
        in sports




163    22 January 2008
Semantic Social Networks




164   22 January 2008
Microformats

  Metadata
  Social links                                                        Geo
  Outline                                                             hResume
                                                                      adr
  Licensing
  tags



                                                            http://microformats.org

• Microformats are small bits of HTML that represent things like people, events,
tags, etc. in web pages.
• Building blocks that enable users to own, control, move, and share their data on
the Web.
• Microformats enable
     • publishing of higher fidelity information on the Web,
     • the fastest and simplest way to support feeds and APIs for your website.
 165   22 January 2008
166   22 January 2008
eRDF
• A subset of RDF embedded into XHTML or HTML by using
  common idioms and attributes.
• No new elements or attributes have been invented and the
  usages of the HTML attributes are within normal bounds.
• This scheme is designed to work with CSS and other HTML
  support technologies.
• HTML Embeddable RDF.
• all HTML Embeddable RDF is valid RDF, not all RDF is
  Embeddable RDF




167   22 January 2008
GRDDL
• Gleaning   Resource Descriptions from Dialects of
  Languages
• Obtaining RDF data from XHTML pages
• Explicitly associated transformation algorithms
  (XSLT)




168   22 January 2008
Acknowledgements
•   Self-tuning Personalized Information Retrieval in an
    Ontology-Based Framework, Pablo Castells, Miriam
    Fernández, David Vallet, et al, OTM Workshop 2005

•   An Approach for Semantic Search by Matching RDF Graphs,
    Haiping Zhu, Jiwei Zhong, Jianming Li and Yong Yu

•   Semantic Web Tutorials




169   22 January 2008
Concluding Remarks
• Personalization: An upcoming area of technology
• Personalization aims at faster access to information to improve
  user productivity
• Server-side Vs Client-side personalization
• Technologies
      • Machine Learning techniques
      • Semantic Web
      • New Markup Languages
•   Challenges
      • Understanding the user behaviour, intentions, likes, …
      • Relating human edited content to the profile


170     22 January 2008
Thank you
Questions?




© 2006 Hewlett-Packard Development Company, L.P.
The information contained herein is subject to change without notice

More Related Content

Similar to Personalization Tutorial at ACM Compute 2008

Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product managementBhaskar Krishnan
 
Trip report for UX coctailhour amsterdam
Trip report for UX coctailhour amsterdamTrip report for UX coctailhour amsterdam
Trip report for UX coctailhour amsterdamdkaremaker
 
NCDM 2008 Web Analytics R Albano Preso Final Publish
NCDM 2008 Web Analytics R Albano Preso Final PublishNCDM 2008 Web Analytics R Albano Preso Final Publish
NCDM 2008 Web Analytics R Albano Preso Final Publishrocco67
 
User Zoom Webinar Monster Aug09
User Zoom Webinar Monster Aug09User Zoom Webinar Monster Aug09
User Zoom Webinar Monster Aug09guest07f4705
 
Analysis and Design of Web Personalization Systems for E-Commerce
Analysis and Design of Web Personalization Systems for E-CommerceAnalysis and Design of Web Personalization Systems for E-Commerce
Analysis and Design of Web Personalization Systems for E-Commerceijbuiiir1
 
Context Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsContext Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsJulia Kiseleva
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningIOSR Journals
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage MiningDaminda Herath
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage miningDaminda Herath
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET Journal
 
A comprehensive guide to user behavioral analytics
A comprehensive guide to user behavioral analytics A comprehensive guide to user behavioral analytics
A comprehensive guide to user behavioral analytics ONE BCG
 
Google analytics Review
Google analytics ReviewGoogle analytics Review
Google analytics ReviewSeth Garske
 
Data-Driven Design for User Experience
Data-Driven Design for User Experience Data-Driven Design for User Experience
Data-Driven Design for User Experience Emi Kwon
 
Intranet 2.0 - Integrating Enterprise 2.0 into your corporate intranet
Intranet 2.0 - Integrating Enterprise 2.0 into your corporate intranetIntranet 2.0 - Integrating Enterprise 2.0 into your corporate intranet
Intranet 2.0 - Integrating Enterprise 2.0 into your corporate intranetJames Dellow
 
Daniel dropiksymposium
Daniel dropiksymposiumDaniel dropiksymposium
Daniel dropiksymposiumDaniel Dropik
 
iData Sciences Product Overview
iData Sciences Product OverviewiData Sciences Product Overview
iData Sciences Product Overviewjvsrinivas1
 
Offline just got reachable
Offline just got reachableOffline just got reachable
Offline just got reachableTwinpine
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberNelson Piedra
 

Similar to Personalization Tutorial at ACM Compute 2008 (20)

Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
 
Trip report for UX coctailhour amsterdam
Trip report for UX coctailhour amsterdamTrip report for UX coctailhour amsterdam
Trip report for UX coctailhour amsterdam
 
NCDM 2008 Web Analytics R Albano Preso Final Publish
NCDM 2008 Web Analytics R Albano Preso Final PublishNCDM 2008 Web Analytics R Albano Preso Final Publish
NCDM 2008 Web Analytics R Albano Preso Final Publish
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
User Zoom Webinar Monster Aug09
User Zoom Webinar Monster Aug09User Zoom Webinar Monster Aug09
User Zoom Webinar Monster Aug09
 
Analysis and Design of Web Personalization Systems for E-Commerce
Analysis and Design of Web Personalization Systems for E-CommerceAnalysis and Design of Web Personalization Systems for E-Commerce
Analysis and Design of Web Personalization Systems for E-Commerce
 
Context Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsContext Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive Analytics
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage mining
 
Personal Web Usage Mining
Personal Web Usage MiningPersonal Web Usage Mining
Personal Web Usage Mining
 
Personal web usage mining
Personal web usage miningPersonal web usage mining
Personal web usage mining
 
IRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search ResultsIRJET - Re-Ranking of Google Search Results
IRJET - Re-Ranking of Google Search Results
 
A comprehensive guide to user behavioral analytics
A comprehensive guide to user behavioral analytics A comprehensive guide to user behavioral analytics
A comprehensive guide to user behavioral analytics
 
Google analytics Review
Google analytics ReviewGoogle analytics Review
Google analytics Review
 
Data-Driven Design for User Experience
Data-Driven Design for User Experience Data-Driven Design for User Experience
Data-Driven Design for User Experience
 
Intranet 2.0 - Integrating Enterprise 2.0 into your corporate intranet
Intranet 2.0 - Integrating Enterprise 2.0 into your corporate intranetIntranet 2.0 - Integrating Enterprise 2.0 into your corporate intranet
Intranet 2.0 - Integrating Enterprise 2.0 into your corporate intranet
 
Daniel dropiksymposium
Daniel dropiksymposiumDaniel dropiksymposium
Daniel dropiksymposium
 
iData Sciences Product Overview
iData Sciences Product OverviewiData Sciences Product Overview
iData Sciences Product Overview
 
Usability 101
Usability 101Usability 101
Usability 101
 
Offline just got reachable
Offline just got reachableOffline just got reachable
Offline just got reachable
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom Gruber
 

Recently uploaded

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 

Recently uploaded (20)

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 

Personalization Tutorial at ACM Compute 2008

  • 1. Personalization: Techniques and applications Krishnan Ramanathan, Geetha Manjunath, Somnath Banerjee HP Labs, Bangalore © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 2. Topics • Overview of Personalization • User Profile creation • Personalizing Search • Document modeling • Recommender system • Semantics in Personalization 2 22 January 2008
  • 3. Overview of Personalization Krishnan Ramanathan © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 4. Why Personalization ? • Scale of the web is limiting its utility • There is too much information • Consumer has to do all the work to use the web • Search engines and portals provide the same results for different personalities, intentions and contexts • Personalization can be the solution • Customize the web for individuals by • Filtering out irrelevant information • Identifying relevant information 4 22 January 2008
  • 5. Some quotes from NY bits •I am married with a house. Why do I see so many ads for online dating sites and cheap mortgages? Should I be happy that I see those ads? It means Internet advertisers still have no idea who I am. 5 22 January 2008
  • 6. Personalization • Goal – Provide users what they need without requiring them to ask for it explicitly • Steps • Generate useful, actionable knowledge about users • Use this knowledge for personalizing an application • User centric data model – Data must be attributable to specific user • Two kinds • Business Centric : Amazon, Ebay • Consumer Centric • Personalization requires User Profiling 6 22 January 2008
  • 7. Applications of Personalization • Interface Personalization • E.g. Go directly to the web page of interest instead of site home page • Content personalization • Filtering (News, blog articles, videos etc) • Ratings based recommendations • Amazon, Stumbleupon • Search • Text, images, stories, research papers • Ads • Service Personalization 7 22 January 2008
  • 8. Why is personalization hard ? • Server side personalization – Sites do not see all data • E.g. A user might visit Expedia and Orbitz, Expedia doesn’t know what the user did on Orbitz • Difficult to get user context • User needs to agree to cookies or login • Site profiles are not portable • Some standards are emerging (Attention profile markup language) • Privacy 8 22 January 2008
  • 9. Personalization example 1 (Routing queries) Google alerts Google news page routing queries 9 22 January 2008
  • 10. Personalization example 2 - Amazon 10 22 January 2008
  • 11. Personalization Example 3 – Google news 11 22 January 2008
  • 12. Personalization Example 4 – Yahoo MyWeb 12 22 January 2008
  • 13. The future … 13 22 January 2008
  • 14. User Profile Creation Krishnan Ramanathan © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 15. Outline • User Profile creation • Profile Privacy • Evaluating and managing user profiles • Personalizing search 15 22 January 2008
  • 16. User profile information • Two kinds of information • Factual (Explicit) • Behavioral (Implicit) • Factual – Geographic, Demographic, Psychographic information • Eg. Age is 25 years, searched for Lexus, lives in Bangalore • Behavioral – Describes behavioral activities (visits finance sites, buys gadgets) 16 22 January 2008
  • 17. Client side versus Server side profiles Server side Client side Have queries, clickstreams from No access to clickstreams of multiple users multiple users Don’t see all the user data See all user data No way for users to aggregate Possible for user to aggregate and reuse the profiles different and reuse their attentional websites (Google, Yahoo, ..) build information using their data Strong privacy model Privacy is a big problem Can access the full compute Server cycles have to be shared, power at the client however some computations can be done once and reused 17 22 January 2008
  • 18. Desired profile characteristics • Represent multiple interests • Adapt to changing user interests • Incorporate contextual information 18 22 January 2008
  • 19. Using User profiles to personalize services Search query, Content news, video, … Explicit and User Implicit info Profile Profile Data Profile to Content Collection Constructor Matching User Personalized services Diagram adapted from Gauch et.al, Chapter 2, The Adaptive Web, Springer LNCS 4321 19 22 January 2008
  • 20. User Profiling approaches • Broadly two approaches • IR approach • User interests derived from text (documents/search queries) • Machine learning approach • Model user based on positive and negative examples of his interests • Problems • Getting labeled samples • High dimensional feature space 20 22 January 2008
  • 21. Profile building Steps • Authenticate the user • Select information to build profile from and archive the information if necessary (eg. Web pages might get flushed from IE cache) • Build/Refresh/Expand/Prune the profile • Use it in an application • Evaluate the profile 21 22 January 2008
  • 22. Authenticating the user • Users need to be authenticated in order to attribute data to a particular user for profile creation • Identifying single user • Login • Cookies • IP address (when it is static) • Identifying different users on same machine • Login • Biometrics 22 22 January 2008
  • 23. Explicit user information collection • Ask the user for • static information • Name, age, residence location, hobbies, interests etc • Google personalization – found explicit information to be noisy • People specified literature as one of their interests but did not make a single related search • Matchmine – presents examples (movies, TV shows, music, blog topics) and asks the users to explicity rate them • Ratings • Netflix, Stumbleupon (thumbs up/down) • In general, people do not like to give explicit information frequently • Recent research (Jian Hu WWW 2007) showed good results for gender and age prediction based on users browsing behavior 23 22 January 2008
  • 24. Explicit information collection: Matchmine interface 24 22 January 2008
  • 25. Implicit user information collection • Data sources • Web pages, documents, search queries, location • Information from applications (Media players, Games) • Data collection techniques • Desktop based • Browser cache • Proxy servers • Browser plugins • Server side • Web logs • Search logs 25 22 January 2008
  • 26. How much implicit info to use ? • Teevan (SIGIR 2005) constructed two profiles • One with only search queries • Other using all information on desktop • Findings • More richer information => better profile • All docs better than only recent docs better than only web pages better than only search queries better than no personalization • Drawback with implicit info – cannot collect info about user dislikes 26 22 January 2008
  • 27. Stereotypes • Generalizations from communities of users • Characteristics of group of users • Stereotypes alleviate the bootstrap problem • Construction of stereotypes • Manual – e.g. Bangalore user will be interested in IT • Automatic method • Clustering – Similar profiles are clustered and common characteristics extracted 27 22 January 2008
  • 28. How Acxiom delivers personalized ads (source - WSJ) • Acxiom has accumulated a database of about 133 million households and divided it into 70 demographic and lifestyle clusters based on information available from public sources. • A person gives one of Acxiom’s Web partners his address by buying something, filling out a survey or completing a contest form on one of the sites. • Acxiom checks the address against its database and places a “cookie,” or small piece of tracking software, embedded with a code for that person’s demographic and behavioral cluster on his computer hard drive. • When the person visits an Acxiom partner site in the future, Acxiom can use that code to determine which ads to show • Through another cookie, Acxiom tracks what consumers do on partner Web sites 28 22 January 2008
  • 29. Profile representation • Bag of words (BOW) • Use words in user documents to represent user interests • Issues • Words appear independent of page content (“Home”, “page”) • Polysemy (word has multiple meanings e.g. bank) • Synonymy (multiple words have same meanings e.g. joy, happiness) • Large profile sizes • Concepts (e.g. DMOZ) • Use existing ontology maintained for free • Issues • Too large (about 6 lakh DMOZ nodes), ontology has to be drastically pruned for use • Need to build classifiers for each DMOZ node 29 22 January 2008
  • 30. Word based term vector profiles • Profile represented as sets of words tf*idf weighted • Could use one long profile vector or different vectors for different topics (sports, health, finance) • Documents converted to same representation, matched with keyword vectors using cosine similarity • Should take structure of the document into account (ignore html tags, email header vs body) 30 22 January 2008
  • 31. Word based hierarchical profiles Support of User Profile:10 Interest Research:5 Sports:3.5 Sex:1.5 IR:3 DB:2 Soccer:2 Others:1.5 Search:2 ... Support decreases from high to low level, and from left to right We are thankful to Yabo Arber-Xu from Simon Fraser University for kindly allowing us to use slides numbered 31,37,38,39 from his WWW 07 presentation. 31 22 January 2008
  • 32. Building word based hierarchical profiles • Builda (word, document) map for each word occurring in the corpus • Order words by amount of support • Support of a word = number of documents in which word appears • For each word • Decide whether to merge with another word (using some measure of similarity) • Decide whether to make one word the child of other 32 22 January 2008
  • 33. 33 22 January 2008
  • 34. Term similarity and Parent-child terms • Words that cover the same document sets are similar • Jacquard measure Sim( w1, w2) =| D( w1 ) I D( w2 ) | / | D ( w1 ) U D ( w2 ) | • Parent child terms • A specific term is a child of a more general term if it frequently occurs with a general term (but the reverse is not true) • Word w2 is taken as child of term w1 if P(w1|w2) > some_threshold • e.g. Terms “Soccer” and “Badminton” might co-occur with the term “Sport” but not the other way around 34 22 January 2008
  • 35. Personalization and Privacy • Studies have shown that • People are comfortable sharing preferences (favourite TV show, snack etc.), demographic and lifestyle information • People not comfortable sharing financial and purchase related information • Facebook fiasco because of reporting “Your friends bought …” • Financial rewards (even small amounts) encourage disclosure • People parted with valuable information for Singapore $15 35 22 January 2008
  • 36. Privacy related attitudes (Teltzrow/Kobsa 2003) 36 22 January 2008
  • 37. What and How much to Reveal? - 1 More User Profile:10 Sensitive More Research:5 Sports:3.5 Sex:1.5 specific IR:3 DB:2 Soccer:2 Others:1.5 Search:2 ... Manual Option – Absolute privacy guarantee, but requires a lot of user intervention 37 22 January 2008
  • 38. What and How much to Reveal? - 2 User Profile U à indicator of a user’s possible interests Term t à indicator of a possible interest, P(t)=Sup(t)/|D| The amount of information for an interest t I(t) = log(1/P(t))= log(|D|/ Sup(t)). àindication of the specificity and sensitivity of an interest H(U) – the amount of information carried by U H(U)=∑tP(t)×I(t) Two Privacy Parameters: MinDetail - Protect t with P(t)<MinDetail ExpRatio – H(U[exp] )/H(U) The more detail we expose, the higher expRatio. 38 22 January 2008
  • 39. What and How much to Reveal? - 3 User Profile:10 minDetail=0.5 expRatio=44% minDetail=0.3 Research:5 Sports:3.5 Sex:1.5 expRatio=69% IR:3 DB:2 Soccer:2 Others:1.5 Search:2 ... The mindetail and expRation parameters allow a balance between privacy and personalization. 39 22 January 2008
  • 40. Profile portability • Move the profile to a central server • Claria PersonalWeb, Google-Yahoo-Microsoft • Provision to delete search queries, visited pages • No control over which part of the profile can be used • Have a client side component that reconstructs the profile on the client using server side info (Matchmine) • Attention Profile markup language • Allows explicit and implicit information to be stored (as XML) and provided to web services 40 22 January 2008
  • 42. Application-independent evaluation of the profile • Stability • Number of profile elements that do not change over the evaluation cycle • Precision • How many items in the profile does the user agree with as representative of his interests ? • Does the user agree with the strength of the interest ? • Do interests at deeper levels of the hierarchy have less precision compared to interests at higher levels ? • Which data sources (bookmarks, search keywords, web pages) is better ? • Bookmarks were not very representative of user interests in our study 42 22 January 2008
  • 43. Profile evaluation Sample Evaluation of one profiling algorithm 0.8 0.7 0.6 0.5 Stability Stability_alpha 0.4 •Profiles are stable (fig 1) 0.3 Stability_date 0.2 •Profile elements with high support 0.1 have high precision (fig 2) 0 0 200 400 600 •Profile elements at all levels of the Number of web pages in cache hierarchy have similar precision (fig 3) Figure 1 1 1.2 0.95 1 0.9 Percent (%) 0.8 Precision Percentage in profile 0.85 0.6 Precis ion 0.8 0.4 0.75 0.2 0.7 0 Support > 5 3 < Support < 5 Support < 3 Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Figure 2 Figure 3 43 22 January 2008
  • 44. Managing the profile • Profiles may need to be expanded (bootstrapped) or pruned • Allowing users to manually edit their profiles to add/delete topics of interest was found to make performance worse (Jae-wook Ahn, WWW 2007) • Adding and deleting topics to profile harmed system performance • Deleting topics harmed performance four times more compared to adding topics • Some agents learn short term and long term profiles separately using different techniques (K-NN for short term interests, Naïve Bayes for long term interests) 44 22 January 2008
  • 45. Personalizing Search Krishnan Ramanathan © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 46. Personalized search • Search can be personalized based on • User profile • Current working context • Past search queries • Server side clickstreams • Personalized Pagerank • Determining user intent is hard (e.g query Visa) 46 22 January 2008
  • 47. A generic personalized search algorithm using a user profile • Inputs- User profile, Search query • Output – A results vector reordered by the user’s preference • Steps • Send the query to a search engine • Results[] = A vector of the search engine’s results • For each item i in Results[] calculate the preference Pref [i] = α *Similarity(Results[i] , User Profile) + (1- α)*SearchEngineRank • Sort Results[] using Pref [i] as the comparator 47 22 January 2008
  • 48. Current working context – JIT retrieval • Context includes time, location, applications currently running, documents currently opened, IM status • Use profile and current context to provide relevant (and just-in-time) information • Blinkx toolbar – provides relevant news, video and Wikipedia articles within different applications (Micrsoft Word, IE browser) • Intersectinterests from the overall profile with current context to get the contextual profile • Context can also be used in query expansion 48 22 January 2008
  • 49. Personalization based on Search history • Use query-to-query similarity to suggest results that satisfied past queries • Create user profiles from past queries/snippets from search results clicked • Misearch (Gauch et.al 2004) creates weighted concept hierarchies based on ODP as the reference concept hierarchy • Compute degree of similarity between search engine result snippets (title and text summaries) and user profile as n sim ( user i , doc j ) = ∑ wp k =1 ik * wd jk wp ik = weight of concept k in profile i wd jk = weight of concept k in document j 49 22 January 2008
  • 50. Personalization by clickthrough data analysis – CubeSVD (Jian-Tao Sun, WWW 2005) • Search engine has tuples of the form (User, Query, Visited page) • Multiple tuples constitute a tensor (generalization of matrix to higher dimensions) • Higher order SVD (HOSVD) performs SVD on tensor • The reconstructed tensor is a tuple of the form (User, Query, web page, p) • Where p is the probability that the user posing the query will visit the web page • Recommend pages with highest value of p • Computationally intensive but HOSVD can be done offline • Need to recompute to account for new clickthrough data 50 22 January 2008
  • 51. Topic sensitive pagerank (Haveliwala 2002) • For top 16 ODP categories, create a pagerank vector • Each web page/document d has multiple ranks depending on what the topic of interest j is • For a query compute, P(Cj|q) = P(Cj)*P(q,Cj) • Intuition: If a topic is more probable given a query, the topic specific rank should have more say in the final rank • Compute query sensitive rank as ∑ P(C j | q) * rank jd 51 22 January 2008
  • 52. Topics • Overview of Personalization • User Profile creation • Personalizing Search • Document modeling • Recommender System • Semantics in Personalization 52 22 January 2008
  • 53. Document modeling Somnath Banerjee © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 54. Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM): A machine learning algorithm 54 22 January 2008
  • 55. Document representation • Term vector • Document is represented as vector of terms • Each dimension corresponds to a separate term • Several methods of computing the weights of the terms • Binary weighting: 1 if the word appear in the document • Most well known is TF*IDF ni , j tf i , j = ∑n k k, j D idf i = log {d j : ti ∈ d j } tfidf i , j = tf i , j × idf i 55 22 January 2008
  • 56. Computing similarity sim ( A, B ) = cos ine (θ ) = A•B = ∑ A ×Bi i A 2 × B 2 ∑A ∑B i 2 i 2 AI B Jaccard coefficien t = J ( A, B ) = AU B 2 AI B Dice' s coefficien t = D ( A, B ) = A+ B 56 22 January 2008
  • 57. Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition • g2: Google Closes In on DoubleClick Acquisition • g3: FTC clears Google DoubleClick deal • g4: US regulator clears DoubleClick deal • g5: DoubleClick deal brings greater focus on privacy • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • Underlined words appeared in more than one documents 57 22 January 2008
  • 58. Term Document Matrix (X) g1 g2 g3 g4 g5 e1 e2 e3 google 1 1 1 0 0 0 0 0 green 1 0 0 0 0 0 0 1 ftc 1 0 1 0 0 0 0 0 doubleclick 1 1 1 1 1 0 0 0 acquisition 1 1 0 0 0 0 0 0 clear 0 0 1 1 0 0 0 0 deal 0 0 1 1 1 0 0 0 eu 0 0 0 0 0 1 1 1 aviation 0 0 0 0 0 1 1 1 emmision 0 0 0 0 0 1 1 0 58 22 January 2008
  • 59. Retrieval example • Query (or Profile) q = “Google Acquisition” • Query vector q = [1 0 0 0 1 0 0 0 0 0]' • Cosine similarity of the query to the documents g1 g2 g3 g4 g5 e1 e2 e3 S= 0.634 0.816 0.447 0 0 0 0 0 • What about the documents g4 and g5? • Problem of data sparsity 59 22 January 2008
  • 60. Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM) ): A machine learning algorithm 60 22 January 2008
  • 61. Latent Semantic Analysis (LSA) • You searching for “Tata Nano” are not the documents containing “People’s Car” also relevant? • How a machine can understand that? • Analyze the collection of documents • Documents that contain “Tata Nano” generally contain “People’s Car” as well • Covariance of these two dimensions are high • LSA finds such correlation using a technique from linear algebra 61 22 January 2008
  • 62. LSA • Transforms the term document matrix into a relation between the • terms and some concepts, • relation between those concepts and the documents • Concepts are the dimensions of maximum variance • Removes the dimensions with low variance • Reduction in feature space • Term document matrix becomes denser 62 22 January 2008
  • 63. Singular Value Decomposition documents •1 •2 •3 … D' terms X = T S •m mxm mxd txd txm •1• •2 • … • •m>0 m is the rank of the matrix X T and D are orthonormal matrix S is a diagonal matrix of singular values 63 22 January 2008
  • 64. Reduced SVD documents •1 •2 •3 Dk ' … = Tk Sk terms Xk •k mxk mxk txd txk -Choose largest k singular values (•1… •k) -Choose k columns of T and D -Then construct Xk -Xk is the best k rank approximation of X in terms of Frobenius norm 64 22 January 2008
  • 65. Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition • g2: Google Closes In on DoubleClick Acquisition • g3: FTC clears Google DoubleClick deal • g4: US regulator clears DoubleClick deal • g5: DoubleClick deal brings greater focus on privacy • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • Query (or Profile) q = “Google Acquisition” 65 22 January 2008
  • 66. Term Document Matrix (X) g1 g2 g3 g4 g5 e1 e2 e3 google 1 1 1 0 0 0 0 0 green 1 0 0 0 0 0 0 1 ftc 1 0 1 0 0 0 0 0 doubleclick 1 1 1 1 1 0 0 0 acquisition 1 1 0 0 0 0 0 0 clear 0 0 1 1 0 0 0 0 deal 0 0 1 1 1 0 0 0 eu 0 0 0 0 0 1 1 1 aviation 0 0 0 0 0 1 1 1 emmision 0 0 0 0 0 1 1 0 66 22 January 2008
  • 67. LSA Example T(10x7) = S(7x7) = D‘(7x8) = 67 22 January 2008
  • 68. LSA Example • Rank 2 approximation of X documents terms 68 22 January 2008
  • 69. LSA Example • Query (or Profile) q = “Google Acquisition” • Query vector q = [1 0 0 0 1 0 0 0 0 0]' • Representation of the query Dq = q'T2S2 -1 = [-0.204 0.005 ] • Query to document similarity Sim = Dq S22 D2' 69 22 January 2008
  • 70. LSA Example Dq S22 X X D2' Sim = 70 22 January 2008
  • 71. Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition [1.28 4] • g2: Google Closes In on DoubleClick Acquisition [0.936] • g3: FTC clears Google DoubleClick deal [1.426] • g4: US regulator clears DoubleClick deal [0.891] • g5: DoubleClick deal brings greater focus on privacy [0.697] • e1: EU Agrees to Reduce Aviation Emissions [0.035] • e2: Aviation to be included in EU emissions trading [0.035] • e3: EU wants tougher green aviation laws [0.152] • Underlined words appeared in more than one documents 71 22 January 2008
  • 72. Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM) ): A machine learning algorithm 72 22 January 2008
  • 73. Probabilistic Latent Semantic Analysis (PLSA) • If we know the document collection contains two topics can we do better? • Can we estimate • Probability( topic | document) ? • Probability( word | topic) ? • If we can also estimate Probability( topic | query) then we can compute the document to query similarity • PLSA is a statistical technique to estimate those probability from a collection of documents 73 22 January 2008
  • 74. Probabilistic Latent Semantic Analysis (PLSA) • Dyadic data: Two (abstract) sets of objects, X ={x1, ..,xm} and Y ={y1, … ,yn} in which observations are made of dyads(x,y) • Simplest case: observation of co-occurrence of x and y • Other cases may involve scalar weight for each observation • Examples: • X = Documents, Y =Words • X = Users, Y =Purchased Items • X = Pixels, Y =Values 74 22 January 2008
  • 75. PLSA • Document consists of topics and words in the document are generated based on those topics • Generative model (asymmetric): (di, wj) is generated as follow • pick a document with probability P(di), • pick a topic zk with probability P(zk | di), • generate a word wj with probability P(wj | zk) ( ) ( P d i , w j = P(d i )P w j | d i ) P(di) P(zk |di) P(wj |zk) ( ) ∑ P(w j | z k )P(z k | d i ) K D Z W P w j | di = k =1 75 22 January 2008
  • 76. PLSA • Parameters P(di), P(zk | di), P(wj | zk) • P(di) is proportional to number of times the document is observed and be computed independently • P(zk | di), P(wj | zk) can be estimated using Expectation Maximization (EM) algorithm ∏∏ P(d , w ) N M P ( D, W ) = i j n(di ,w j ) i =1 j =1 ∑∑ n(d , w )ln P(d , w ) M N L= i j i j i =1 j =1 M = Number of documents; N = Number of distinct words 76 22 January 2008
  • 77. PLSA: EM steps • E-Step: ( ) P z k | di , w j = ( ) P w j | z k P(z k | d i ) ∑ P(w ) K j | zl P(zl | d i ) l =1 M-Step: ∑ n(d , w )P(z ) • N i j k | di , w j ( P w j | zk = ) i =1 M N ∑∑ n(d , w m =1 i =1 i m )P(z k | d i , wm ) ∑ n(d , w )P(z ) M i j k | di , w j P (z k | d i ) = j =1 n( d i ) 77 22 January 2008
  • 78. PLSA Example g1 google e1 eu g1 green e1 aviation •Dyadic data in our g1 ftc e1 emission example g1 doubleclick e2 aviation g1 acquisition e2 eu g2 google e2 emission g2 doubleclick e3 eu g2 acquisition e3 green g3 ftc e3 aviation g3 clear g3 google g3 doubleclick g3 deal g4 clear g4 doubleclick g4 deal g5 doubleclick g5 deal 78 22 January 2008
  • 79. PLSA Example • After 20 iterations of EM algorithm P(zk |di) P(wj |zk) 79 22 January 2008
  • 80. PLSA Example • Query q = “Google Acquisition” • Steps • Keep P(wj |zk) fixed. • Estimate P(zk |q) using EM steps • Then compute cosine similarity of the vector P(Z|q) to the P(Z|d) Z1 Z2 q 1 0 P(zk |q) 80 22 January 2008
  • 81. Example • g1: Google Gets Green Light from FTC for DoubleClick Acquisition [1.0] • g2: Google Closes In on DoubleClick Acquisition [1.0] • g3: FTC clears Google DoubleClick deal [1.0] • g4: US regulator clears DoubleClick deal [1.0] • g5: DoubleClick deal brings greater focus on privacy [1.0] • e1: EU Agrees to Reduce Aviation Emissions [0.0] • e2: Aviation to be included in EU emissions trading [0.0] • e3: EU wants tougher green aviation laws [0.0] • Underlined words appeared in more than one documents 81 22 January 2008
  • 82. Under this topic • Document representation • Document analysis using • Latent Semantic Analysis (LSA) • Probabilistic Latent Semantic Analysis (PLSA) • Document Classification • Support Vector Machine (SVM) ): A machine learning algorithm 82 22 January 2008
  • 83. Document Classification 83 22 January 2008
  • 84. Document classification with SVM • We will concentrate on binary classification • {sports, not sports}, {interesting, not interesting} etc • In general {+1,-1} also called {positive, negative} • SVM is a supervised machine learning technique. It learns the pattern from a training set • Training set • A set of documents with labels belonging to {+1, -1} • SVM tries to draw a hyperplane that best separates the positive and negative data in the training set 84 22 January 2008
  • 85. Support Vector Machine (SVM) • A Machine learning algorithm • SVM was introduced in COLT-92 by Boser, Guyon and Vapnik. • Initially popularized in the NIPS community, now an important and active field of all Machine Learning Research • Successful applications in many fields (text, bioinformatics, handwriting, image recognition etc.) 85 22 January 2008
  • 86. SVM – Maximum margin separation SVM illustration by Bülent Üstün Radboud Universiteit 86 22 January 2008
  • 87. Mapping to higher dimension for non- separable data P1 • (0,0) x {+1} P2 • (0,1) x {-1} P2 P3 P3 • (1,1) x {+1} P4 • (1,0) x {-1} P1 P4 P1 • (0,0,0) x {+1}  x  2  1 x → φ (x ) →  x  2 2 P2 • (0,1,0) x {-1} x x  P3 • (1,1,1) x {+1}  1 2 P4 • (1,0,0) x {-1} 87 22 January 2008
  • 88. The XOR example SVM uses kernel trick to map data to higher dimensional feature space without incurring much computational overhead 88 22 January 2008
  • 89. Recommender System Select top N items for a user -Somnath Banerjee © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 90. Example 90 22 January 2008
  • 91. Example 91 22 January 2008
  • 92. Classification • Broadly three approaches • Content Based Recommendation • Collaborative Filtering • Hybrid approach 92 22 January 2008
  • 93. Content based recommendation • Utility of an item for a user is determined based on the items preferred by the user in the past • Applies similar techniques as introduced in the document modeling part 93 22 January 2008
  • 94. Basic Approach • Create and represent the user profile from the items rated by the user in the past • A popular choice of profile representation is vector of terms weighted based on TF*IDF • Represent the item in the same format • A news item can be represented using (TF*IDF) term vector • For movies, books one needs to get sufficient metadata to represent the item in vector format • Define a similarity measure to compute the similarity between the profile and the item • Popular choice is cosine similarity • Advance machine learning techniques can also be applied to do the matching • Recommend most similar items 94 22 January 2008
  • 95. Problems with content based recommendation • Knowledge engineering problem • How do you describe multimedia, graphics, movies, songs • Recommendation shows limited diversity • New user problem • It requires large number of ratings from the user to generate quality recommendation 95 22 January 2008
  • 96. Collaborative filtering • Recommends items that are liked in the past by other users with similar tastes • Quite popular in e-commerce sites, like Amazon, eBay • Can recommend various media types, text, video, audio, Ads, products 96 22 January 2008
  • 97. 97 22 January 2008
  • 98. Advantages • Does not have the knowledge engineering problem • Both user and items can be represented using just ids • Often recommendation shows good amount of diversity 98 22 January 2008
  • 99. Lets learn C.F. with an example Ran Casablanca Ben Tomb Raider MI -II Air Force Hur One Jane 5 5 ? 2 Bill 2 3 4 Tom 2 2 5 5 Cathy 3 3 1 1 What rating Jane will possibly give to MI – II? 99 22 January 2008
  • 100. Normalizing the ratings • All users won’t give equal rating even if they all equally liked/disliked an item • Normalize rating r = ru ,i − ru Ran Casablanca Ben Tomb MI -II Air Force Hur Raider One Jane 1 1 -2 Bill -1 0 1 Tom -1.5 -1.5 1.5 1.5 Cathy 1 1 -1 -1 100 22 January 2008
  • 101. Similarity between users • Who are the other users with similar taste like Jane • Each row of the matrix is a vector representing the user • Compute cosine similarity between the users Bill Tom Cathy Jane -0.289 -0.612 0.816 101 22 January 2008
  • 102. Compute probable rating • Possible rating is the rating given by the other users weighted by the similarity • Sometimes only top N similar users are taken ∑ sim(u, v )∗ (r − r ) v∈V v ,i v ru ,i = ru + ∑ sim(u, v ) v∈V Jane will rate MI-II as (−0.289 × 1) + (−0.612 × 1.5) + (0.816 × −1) • = 4+ 0.289 + 0.612 + 0.816 ≈ 2.82 102 22 January 2008
  • 103. Remarks • There is another popular version of the above technique where instead of user to user similarity item to item similarity is computed • Rating prediction is based on the similarity to the items rated by the user • The above mentioned methods are known as memory based techniques • It has the disadvantage that it require more online computations 103 22 January 2008
  • 104. Model based technique • A model is learnt using the collection of ratings as training set • Prediction is done using the model • More offline computing and less online computing 104 22 January 2008
  • 105. Model based technique • A simple model ru ,i = E (ru ,i ) = ∑ r × Pr (r u ,i = r | ru , s′ , s ′ ∈ I u ) r∈R 105 22 January 2008
  • 106. Model based technique • Recent research tries to model the recommendation process with more complex probabilistic models u z i r P(r | u , i ) = ∑ P(r | z, i )× P(z | u ) z • Parameters P(r|z,i) and P(z|u) can be estimated using EM algorithm 106 22 January 2008
  • 107. Problems of C.F. • New user problem • New Item problem • Sparsity problem • A user rates only a few items • Unusual user • User whose tastes are unusual compared to the rest of the population 107 22 January 2008
  • 108. Hybrid approaches - Combining Collaborative and Content based methods • Combining predictions of Content based method and C.F. • Implement separate content based and collaborative filtering method • Combine their predictions using • Linear combination • Voting schemes • Alternatively select a prediction method based on some confidence measure on the recommendation 108 22 January 2008
  • 109. Hybrid Approaches • Adding content based characteristics into a C.F. based method • Maintain a content based profile for each user • Use these content based profiles (not the commonly rated items) to compute the similarity between users • Then do C.F. • Helps to overcome sparsity related problems as generally not many items are commonly rated two users 109 22 January 2008
  • 110. Hybrid approaches • Adding C.F. characteristics into a content based method • Most popular techniques in this category is dimensionality reduction on a group of content based profiles • Dimensionality reduction technique like LSA can improve prediction quality by having compact representation of profile 110 22 January 2008
  • 111. Future directions of research (Adomavicious et al) • Incorporating richer user and item profile in a unified framework of different methods • Using contextual information in recommendation • Example: Recommending a vacation package the system should consider • User • Time of the year • With whom the user plans to travel • Traveling conditions and restrictions at the time • Multi-Criteria ratings • E.g. three criteria restaurant ratings food, décor and service 111 22 January 2008
  • 112. Future directions of research • Non-intrusiveness • Flexibility • Enabling end-users to customize recommendation • Evaluation • Empirical evaluation on test data that users choose to rate • Items that users choose to rate are likely to be biased • Economics-oriented measures 112 22 January 2008
  • 113. References (Recommender System) • Adomavicius, G., and Tuzhilin, A., “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and possible Extensions”, IEEE Transaction on Knowledge and Data Engineering, 2005 113 22 January 2008
  • 114. Semantics in Personalization Geetha Manjunath Hewlett Packard Labs India © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 115. Topic Outline • Why use semantic information? • Introduction to Ontology • Formal Specification of an Ontology • A Quick Overview of Semantic Web • Techniques and Approaches • Word Sense Disambiguation • Semantic Profiles • Constrained Spreading Activation • Semantic Similarity • Looking Ahead 115 22 January 2008
  • 116. News Example Revisited • g1: Google Gets Green Light from FTC for DoubleClick Acquisition • g2: Google Closes In on DoubleClick Acquisition • g3: FTC clears Google DoubleClick deal • g4: US regulator clears DoubleClick deal • g5: DoubleClick deal brings greater focus on privacy • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • 116 22 January 2008
  • 117. News Example Modified • g1: Apple Gets Green Light from FTC for TripleClick Acquisition • g2: Apple Closes In on TripleClick Acquisition • g3: FTC clears Apple TripleClick deal • g4: US regulator clears TripleClick deal IT company Google • g5: TripleClick deal brings greater focus on privacy Acquisition Acquisition • e1: EU Agrees to Reduce Aviation Emissions • e2: Aviation to be included in EU emissions trading • e3: EU wants tougher green aviation laws • f1: Apple prices soaring high. • f2: Increased apple rates causes concern to doctors. • f3: Cost of 10 kg of apple to become Rs 1000 from 1 Feb. 117 22 January 2008
  • 118. Semantics for Personalization Profile Represent Search query, Representation Content Profiles as news, video, using domain … meaningful concepts concepts Explicit and User Implicit info Profile Profile Data Profile to Content Collection Constructor Matching Semantics based Matching Function Implicit Expand the User Personalized services Cluster Documents Info based generated documents on domain profile using based on knowledge domain info better User 118 22 January 2008 groups
  • 119. Techniques and Approaches 1. Implicit Information based on domain knowledge • Word Sense Disambiguation 2. Represent Profiles as meaningful concepts • Semantic Profiles 3. Semantics based Matching Function • Semantic Distance 4. Expand the generated profile using domain info • Constrained Spreading Activation 5. Cluster documents based on better User groups • Social Semantic Networks 119 22 January 2008
  • 120. Word Sense disambiguation Animal Using Wordnet Transport Mammal Vehicle Hyponyms Meronyms Carnivore Motor Vehicle tail Accelerator fur Feline Automobile Door nail contains Bumper Big cat Car Wheel type of Synonyms Jaguar Panther Jaguar same as 120 22 January 2008
  • 121. Word Sense disambiguation Abstract entity entity group substance employee organization Advisory board solid stocks eat animal institution food plant Revenue ripe Business Acquisition tree company Sales tax fruit plant skin ….. seed Apple Apple pulp KEY: Additional domain information 121 22 January 2008
  • 122. Three level Conceptual Network • Domain Ontology • Co-occurrence • synonyms • hyponyms • .. • Hyperlinks • Order of access • Browsed together •… 122 22 January 2008
  • 123. Introduction to Ontologies © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 124. Views on Ontologies TopicMaps Front-End Thesauri Navigation Taxonomies Information Retrieval Query Expansion Sharing of Knowledge Queries Ontologies Semantic Networks Consistency Checking EAI Mediation Reasoning Extended ER-Models Predicate Logic Back-End 124 22 January 2008
  • 125. Structure of an Ontology Ontologies typically have two components: • Names for important concepts in the domain • Elephant is a concept whose members are a kind of animal • Herbivore is a concept whose members are exactly those animals who eat only plants or parts of plants • Background knowledge/constraints on the domain • No individual can be both a Herbivore and a Carnivore 125 22 January 2008
  • 126. A Simple Ontology Object Is a Is a knows Described in Person Topic Document writes Is a Student Researcher Semantics Ontology Is a similar PhD Student Described in Is about Topic Document Document Topic Is about writes knows Person Document Topic Person Topic 126 22 January 2008
  • 127. Defining Ontology [Gruber, 1993] An Ontology is a formal specification Ø Executable of a shared Ø Group of persons conceptualization Ø About concepts of a domain of interest. Ø Application & “unique truth” •Formal description of concepts and their relationships •Strong Basis in the family of First Order Logics (DL) •Deductive Inference based on ground truth of the domain. 127 22 January 2008
  • 128. Formal Specification of Ontologies Semantic Web: A quick introduction © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 129. The Semantic Web Vision Semantic web aims to transform WWW into a global database “The semantic web is a web for computers” 129 22 January 2008
  • 130. Semantic web Make web resources more accessible to automated processes • Extend existing rendering markup with semantic markup • Metadata annotations that describe content/funtion of web accessible resources • Use Ontologies to provide vocabulary for annotations • “Formal specification” is accessible to machines • A prerequisite is a standard web ontology language • Need to agree common syntax before we can share semantics • Syntactic web based on standards such as HTTP and HTML 130 22 January 2008
  • 131. Semantic Web Layers Context for vocabulary Globally User definable, Unambiguous domain specific Identifiers markup 131 22 January 2008
  • 132. What is RDF ? • RDF – resource description framework • RDF is a data model • Statement-based approach • Subject/predicate/object triples – simple powerful unit • All resources identified by URIs • Triples create a directed labelled graph of • object/attribute/value • (semantic) relationships between objects • RDF model is an abstract layer independent of XML • XML serialization is supported 132 22 January 2008
  • 133. RDF Example resource value ../presentation.ppt property dc:creator dc:date dc:description people.com/../dave_reynolds Some starter slides… org:email 2005-09-23 mailto:dave.reynolds@hp.com <rdf:Description rdf:about=“allppt.com/presentation.pptquot;> <dc:creator resource=“people.com/person/dave_reynoldsquot;/> </rdf:Description> Enables easy merge of information <rdf:Description rdf:ID=“people.com/person/dave_reynoldsquot;> • Indirect metadata (anyone can say anything about anything) <org:email resource= “mailto:dave.reynolds@hp.com” /> • Extensibility (open world assumption, compositional) </rdf:Description> 133 22 January 2008
  • 134. RDF Schema • Defines small vocabulary for RDF: • Class, subClassOf, type rdfs:Resource • Property, subPropertyOf rdfs:subClassOf • domain, range Veh: MotorVehicle • Vocabulary can be used to define other vocabularies for yourrdfs:subClassOf application domain Veh: Van Veh: Truck Veh: PassengerVehicle rdfs:subClassOf Veh: MiniVan 134 22 January 2008
  • 135. OWL – Web Ontology Language • A language to express an ontology • An OWL ontology is an RDF graph • A set of RDF triples • Vocabulary Extension Domain Restrictions/Truth • Structure • Ontology headers Important Concepts of the Domain • Class Axioms • Class Descriptions, Enumeration, Membership Restrictions • Property Axioms • Property Descriptions, Property Restrictions, Functional Spec • Facts about individuals 135 22 January 2008
  • 136. OWL Class Constructors 136 22 January 2008
  • 137. The Syntax Parent = Person with at least one child <owl:Class rdf:ID=“Parent”> <owl:intersectionOf > <owl:Class rdf:about=quot;#Personquot;/> <owl:Restriction> <owl:onProperty rdf:resource=quot;#hasChildquot;/> <owl:minCardinality>1</owl:minCardinality> </owl:Restriction> </owl:intersectionOf> </owl:Class> 137 22 January 2008
  • 138. OWL Axioms 138 22 January 2008
  • 139. SPARQL • RDF Query Language • Triples with unbound variables • Protocol • HTTP binding • SOAP binding • XML Results Format • Easy to transform (XSLT, XQuery) 139 22 January 2008
  • 140. Why Ontologies? • Enable formalisation of user preferences • Common underlying, interoperable representation • Public vocabulary agreed & shared between different systems • Better content matching & sharing across applications • User interests can be matched to content meaning • Using conceptual reasoning • Richer, more precise, less ambiguous than keyword-based • Provides adequate grounding for hierarchical representation • coarse to fine-grained user interests • Formal, computer processable meaning on the concepts 140 22 January 2008
  • 141. Semantic User Profiles © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 142. Semantic Profiles • User Profile as concepts • Books, Clothes and Soccer Web pages visited by Top the user Shopping Science Sports ..… ..… W=2 W=0 W=1 Books Clothes Soccer Cricket W=1 W=1 W=1 W=0 How do we map documents/users to concepts? 142 22 January 2008
  • 143. Building concept profiles based on ODP The Machine Learning Approach ODP Training ODP categories + classifier documents Step 1: Build ODP classifier for selected ODP categories User web ODP ODP pages classifier concepts Step 2: Use user data and ODP classifier to build the user profile Add to profile 143 22 January 2008
  • 144. Topic Hierarchy from ODP / DMOZ 144 22 January 2008
  • 145. Using Wikipedia to map documents to concepts Item: “Sony to slash PlayStation3 price” Term vector Representation: <sony:1>,<slash:1>, <playstation3:1>,<price:1> Item: “Jittery Sony Knocks $100 Off PS3 Price Tag” Term vector Representation: <jittery:1>, <sony:1>, <knocks:1> <ps3:1>,<price:1>, <tag:1> Additional features: titles of the retrieved articles query 1. PlayStation Network Platform 2. PlayStation 2 3. Ducks demo 4. PlayStation 3 Sony to slash PlayStation3 price 5. PlayStation 6. Ken Kutaragi 7. PlayStation Portable 8. Console manufacturer 9. Sony Group Index of Wikipedia dump 10. Crystal Dynamics 11. PlayStation 3 accessories 12. … 13. … A Search Approach 145 22 January 2008
  • 146. Profile: Words Vs Concepts TF * IDF based user profile Wikipedia Based user profile Search Text Retrieval Conference Home HTML element Help Bank of America News Google search Privacy ICICI Bank Google IDBI Bank Terms Bank fraud New Artificial neural network Page Web crawler Use Web design Web Debit card View Extensible Markup Language Results Hewlett-Packard Information Microsoft Account XHTML Demand account 146 22 January 2008
  • 147. Semantic Profiles • Vector of weights – representing the intensity of user interest for each concept (-1 to 1) • Content also described by a set of weighted concepts (0 to 1) • Concept Profiles: Can express fine grained interests • Interest in atheletes who have won a gold medal • Interest in IT companies which have acquired atleast 3 companies in the last one year • Only movies with either Amitabh or Sharukh 147 22 January 2008
  • 148. Ontology-based Profile Spreading © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 149. Profile Expansion • Use inference mechanism to enhance personalisation • Synonym expansion • Interest in multiple subclasses implies broader interest • Transitive closure (locatedIn, subtopic) • Interest in superclass leads to potential interest in subclass • Guess changing interest over time 149 22 January 2008
  • 150. Constrained Spreading Artificial Intelligence Machine Learning Neural Networks 150 22 January 2008
  • 151. Constrained Spreading Activation • Cannot take ‘all’ related data • Commonly used SA models • Distance Constraint • Fan-out Constraint • Path Constraints • App dependent inference rules • Type of relationship • Preferential paths • Activation Constraint • Threshold function at each single node level 151 22 January 2008
  • 152. Learning preferences using semantic links Two main ways of updating Concept History Stack 1. Interest Assumption Completion • Add more potential user interests • Based on Hierarchical relationships • Threshold on value of pseudo-occurrence for insertion • Nocc (C supertype) = γ * Nocc (C subtype) where γ < 1 is determined empherically • Based on Semantic relationships • All related concepts such that ∃ prop p, p (C, C related) • Pseudo-occurrence Nocc (Crelated) = αi* Nocc (C) 152 22 January 2008
  • 153. Learning preferences using semantic links (contd) 2. Preference update by expansion • Re-weighting over time • Wnew (Crelated) = Wold (Crelated) + βi * Wnew(C) • βI – Semantic Factor that depends on the level of semantic proximity • Directly part of definition (Tbox) • Related through inferred transitive relation (# such links matter) • Notion of Semantic distance 153 22 January 2008
  • 154. Semantic Similarity © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 155. Similarity/Matching • Cosine similarity •U represents user preference •D represents content object •Dimension: #concepts in the ontology similarity ( U, D )= cos (U ,D) = U• D = ∑ U ×Di i U × D 2 2 ∑U ∑D i 2 i 2 155 22 January 2008
  • 156. Semantic distance d(x,y,c) • Semantic distance between 2 nodes x and y is defined with respect to a concept, c • Example: a black cat and an orange cat • very similar as instances of the category Animal, since their common catlike properties would be the most significant for distinguishing them from other kinds of animals. • But in the category Cat, they would share their catlike properties with all the other kinds of cats, and the difference in color would be more significant. • In the category BlackEntity, color would be the most relevant property, and the black cat would be closer to a crow or a lump of coal than to the orange cat. 156 22 January 2008
  • 157. Semantic Similarity Mtrl: Material Accm: Accompaniment 157 22 January 2008
  • 158. Using Wordnet (hypernyms) 158 22 January 2008
  • 159. Match all nodes 159 22 January 2008
  • 160. Similarity Formula 160 22 January 2008
  • 161. Looking Ahead © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
  • 162. Contextual Personalisation • Finer, qualitative, context sensitive activation of user pref • Notion of a Semantic Runtime Context • Representation: Vector of concept weights • Fuzzy semantic intersection between user preferences and runtime context • Using Constrained spreading activation 162 22 January 2008
  • 163. Semantic Social Networking • Identify hidden links between users • Similarity between user preferences • Collaborative Recommender systems • Use of Global Preferences not correct • Partial & Strong Similarities are very useful • Eg: Coinciding interest in cinema but drastically different in sports 163 22 January 2008
  • 164. Semantic Social Networks 164 22 January 2008
  • 165. Microformats Metadata Social links Geo Outline hResume adr Licensing tags http://microformats.org • Microformats are small bits of HTML that represent things like people, events, tags, etc. in web pages. • Building blocks that enable users to own, control, move, and share their data on the Web. • Microformats enable • publishing of higher fidelity information on the Web, • the fastest and simplest way to support feeds and APIs for your website. 165 22 January 2008
  • 166. 166 22 January 2008
  • 167. eRDF • A subset of RDF embedded into XHTML or HTML by using common idioms and attributes. • No new elements or attributes have been invented and the usages of the HTML attributes are within normal bounds. • This scheme is designed to work with CSS and other HTML support technologies. • HTML Embeddable RDF. • all HTML Embeddable RDF is valid RDF, not all RDF is Embeddable RDF 167 22 January 2008
  • 168. GRDDL • Gleaning Resource Descriptions from Dialects of Languages • Obtaining RDF data from XHTML pages • Explicitly associated transformation algorithms (XSLT) 168 22 January 2008
  • 169. Acknowledgements • Self-tuning Personalized Information Retrieval in an Ontology-Based Framework, Pablo Castells, Miriam Fernández, David Vallet, et al, OTM Workshop 2005 • An Approach for Semantic Search by Matching RDF Graphs, Haiping Zhu, Jiwei Zhong, Jianming Li and Yong Yu • Semantic Web Tutorials 169 22 January 2008
  • 170. Concluding Remarks • Personalization: An upcoming area of technology • Personalization aims at faster access to information to improve user productivity • Server-side Vs Client-side personalization • Technologies • Machine Learning techniques • Semantic Web • New Markup Languages • Challenges • Understanding the user behaviour, intentions, likes, … • Relating human edited content to the profile 170 22 January 2008
  • 171. Thank you Questions? © 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice