SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Downloaden Sie, um offline zu lesen
Parallel Algorithms for Mining 
                      g                 g
                  Large‐
                  Large‐Scale Data


                       Edward Chang
                       Ed   d Ch
                Director, Google Research, Beijing
              http://infolab.stanford.edu/ echang/
              http://infolab stanford edu/~echang/



Ed Chang                   EMMDS 2009                1
Ed Chang   EMMDS 2009   2
Ed Chang   EMMDS 2009   3
Story of the Photos
                 Story of the Photos
   •   The photos on the first pages are statues of two
       “bookkeepers” displayed at the British Museum. One
       bookkeeper keeps a list of good people, and the other a list of
       bad. (Who is who, can you tell ? ☺)
   •   When I first visited the museum in 1998, I did not take a
       photo of them to conserve films. During this trip (June 2009),
       capacity is no longer a concern or constraint. In fact, one can
       see kid grandmas all t ki photos, a l t of photos. T
           kids,      d       ll taking h t     lot f h t      Ten
       years apart, data volume explodes.
   •   Data complexity also grows.
   •   So, can these ancient “bookkeepers” still classify good from
       bad? Is their capacity scalable to the data dimension and
       data quantity ?

Ed Chang                        EMMDS 2009                               4
Ed Chang   EMMDS 2009   5
Outline
  • Motivating Applications 
      – Q&A System
      – Social Ads
  • Key Subroutines
      –    Frequent Itemset Mining [ACM RS 08]
      –    Latent Dirichlet Allocation  [WWW 09, AAIM 09]
           L      Di i hl All      i
      –    Clustering [ECML 08]
      –    UserRank [Google TR 09]
           UserRank [Google TR 09]
      –    Support Vector Machines [NIPS 07]
  • Distributed Computing Perspectives
    Distributed Computing Perspectives
Ed Chang                       EMMDS 2009                   6
Query: What are must‐see attractions at Yellowstone




 Ed Chang               EMMDS 2009                    7
Query: What are must‐see attractions at Yellowstone




 Ed Chang               EMMDS 2009                    8
Query: Must‐see attractions at Yosemite




 Ed Chang               EMMDS 2009        9
Query: Must‐see attractions at Copenhagen




 Ed Chang              EMMDS 2009           10
Query: Must‐see attractions at Beijing



                Hotel ads




   Ed Chang               EMMDS 2009     11
Who is Yao Ming




Ed Chang             EMMDS 2009   12
Q&A Yao Ming
 Q&A Yao Ming




Ed Chang        EMMDS 2009   13
Who is Yao Ming            Yao Ming Related Q&As
                                      Y Mi     R l t d Q&A




Ed Chang                 EMMDS 2009                           14
Application: Google Q&A (Confucius)
               launched in China, Thailand, and Russia
               launched in China Thailand and Russia
                              Differentiated Content


                                      DC
           U

                    S                                     C
                 Search                                            U
                                                       Community

                                   CCS/Q&A

                                 Discussion/Question


       Trigger a discussion/question session during search
       Provide labels to a Q (semi-automatically)
       Given a Q, find similar Qs and their As (automatically)
       Evaluate quality of an answer, relevance and originality
       Evaluate
       E l t user credentials i a t i sensitive way
                         d ti l in topic         iti
       Route questions to experts
       Provide most relevant, high-quality content for Search to index


Ed Chang                         EMMDS 2009                              15
Q&A Uses Machine Learning



                        Label suggestion
                        using ML algorithms.




                       • Real Time topic-to-
                         topic (T2T)
                         recommendation
                         using ML algorithms.

                       • Gi
                         Gives out related hi h
                                   t l t d high
                         quality links to
                         previous questions
                         before human answer
                         appear.



Ed Chang            EMMDS 2009                    16
Collaborative Filtering
                                                  Books/Photos
                                              1 1 1
                                          1       1 1       1       1       1
Based on membership so far,                             1       1           1
 and memberships of others                1       1     1 1
                                              1


                              Users
                                                            1 1
Predict further membership                        1                 1
                                      1 1
                                          1                             1
                                      1                                     1
                                          1 1 1 1 1

 Ed Chang                     EMMDS 2009                                    17
Collaborative Filtering
                                                                Books/Photos
        Based on partially                      ?       1 1 1           ?
        observed matrix                             1 ? 1 1             1       1       1
                                                                    1 ? 1               1
                                                    1       1 ? 1 1
  Predict unobserved entries
                                                        1                   ?


                                        Users
                                                    ?                   1 1

                                        U
                                                            1                   1 ?
I. Will user i enjoy photo j?                   1 1             ?
                                                    1               ?               1
II. Will user i be interesting to user j?
                                                1                               ?       1
III. Will photo i be related to photo j?            1 1 1 1 1 ?

    Ed Chang                           EMMDS 2009                                       18
FIM based Recommendation
           FIM‐based Recommendation




Ed Chang            EMMDS 2009        19
FIM Preliminaries
• Observation 1: If an item A is not frequent, any pattern contains 
   A won’t be frequent [R. Agrawal]
                                        q
      use a threshold to eliminate infrequent items 
  {A}   {A,B}
• Observation 2: Patterns containing A are subsets of (or found 
   from) transactions containing A [J. Han]
   from) transactions containing A [J. Han]
      divide‐and‐conquer: select transactions containing A to form a 
   conditional database (CDB), and find patterns containing A from 
   that conditional database
  {A, B}, {A, C}, {A}   CDB A
  {A, B}, {B, C}  CDB B
• Observation 3: To prevent the same pattern from being found in
   Observation 3: To prevent the same pattern from being found in 
   multiple CDBs, all itemsets are sorted by the same manner (e.g., 
   by descending support)

 Ed Chang                    EMMDS 2009                          20
Preprocessing
                                      • According to 
            f: 4
                                        Observation 1, we 
            c: 4
                                        count the support of 
                                                h            f
            a: 3
facdgimp    b: 3                fcamp   each item by 
            m: 3                        scanning the 
abcflmo     p: 3                fcabm   database, and 
                                        database and
                                        eliminate those 
bfhjo       o: 2                fb      infrequent items 
            d: 1                        from the 
                                        from the
bcksp       e: 1                cbp     transactions.
            g: 1
                                      • According to 
afcelpmn    h: 1                fcamp
            i: 1
                                        Observation 3, we 
                                        Observation 3, we
            k: 1                        sort items in each 
            l:1                         transaction by the 
            n: 1                        order of descending g
                                        support value.
 Ed Chang          EMMDS 2009                             21
Parallel Projection
• According to Observation 2, we construct CDB of item 
  A; then from this CDB, we find those patterns 
   ;                   ,               p
  containing A
• How to construct the CDB of A? 
    – If t
      If a transaction contains A, this transaction should appear in 
                  ti      t i A thi t          ti    h ld         i
      the CDB of A
    – Given a transaction {B, A, C}, it should appear in the CDB of 
      A, the CDB of B, and the CDB of C
      A the CDB of B and the CDB of C
• Dedup solution: using the order of items:
    –   sort {B,A,C} by the order of items    <A,B,C>
    –   Put <> into the CDB of A
    –   Put <A> into the CDB of B
    –   Put <A,B> into the CDB of C
        Put <A B> into the CDB of C
 Ed Chang                       EMMDS 2009                         22
Example of Projection
           fcamp                 p: { f c a m / f c a m / c b }

           fcabm                 m: { f c a / f c a / f c a b }

           fb                    b: { f c a / f / c }

           cbp                   a: { f c / f c / f c }

           fcamp                 c: { f / f / f }

            Example of Projection of a database into CDBs.
            Left: sorted transactions in order of f, c, a, b, m, p
            Right: conditional databases of frequent items


Ed Chang                       EMMDS 2009                            23
Example of Projection
                    p         j
           fcamp              p: { f c a m / f c a m / c b }

           fcabm              m: { f c a / f c a / f c a b }

           fb                 b: { f c a / f / c }

           cbp                a: { f c / f c / f c }

           fcamp              c: { f / f / f }

            Example of Projection of a database into CDBs.
            Left: sorted transactions;
            Right: conditional databases of frequent items


Ed Chang                     EMMDS 2009                        24
Example of Projection
                Example of Projection
           fcamp              p: { f c a m / f c a m / c b }

           fcabm              m: { f c a / f c a / f c a b }

           fb                 b: { f c a / f / c }

           cbp                a: { f c / f c / f c }

           fcamp              c: { f / f / f }

            Example of Projection of a database into CDBs.
            Left: sorted transactions;
            Right: conditional databases of frequent items


Ed Chang                     EMMDS 2009                        25
Recursive Projections [H. Li, et al. ACM RS 08]
       Recursive Projections [H Li et al ACM RS 08]
    MapReduce       MapReduce MapReduce
                                                • Recursive projection form 
    Iteration 1     Iteration 2 Iteration 3
                                                  a search tree
                                                  a search tree
                                                • Each node is a CDB
                        b   D|ab     c    D|abc
                                                • Using the order of items to 
        a     D|a
                                                  prevent duplicated CDBs.
                                                          t d li t d CDB
                        c   D|ac                • Each level of breath‐first 
                                                  search of the tree can be 
D                       c
                            D|bc
                                                  done by a MapReduce 
                                                  done by a MapReduce
        b     D|b                                 iteration.
                                                • Once a CDB is small 
                                                  enough to fit in memory, 
                                                  enough to fit in memory
        c     D|c                                 we can invoke FP‐growth 
                                                  to mine this CDB, and no 
                                                  more growth of the sub
                                                  more growth of the sub‐
                                                  tree.
Ed Chang                                 EMMDS 2009                       26
Projection using MapReduce
   Map inputs       Sorted transactions   Map outputs                  Reduce inputs                     Reduce outputs
   (transactions)   (with infrequent      (conditional transactions)   (conditional databases)           (patterns and supports)
key="": value        items eliminated)    key: value                   key: value                         key: value

   facdgimp            fcamp               p:
                                           m:
                                                fcam
                                                fca           p:{fcam/fcam/cb} p:3, pc:3
                                                                    p: { f c a m / f c a m / c b }
                                                                                                   p:3
                                                                                                   pc:3
                                           a:   fc
                                           c:   f                                                          mf:3
                                                                                                           mc:3
   abcflmo             fcabm               m: f c a b                                                      ma:3
                                           b: f c a                     m: { f c a / f c a / f c a b }     mfc:3
                                           a: f c                                                          mfa:3
                                           c: f                                                            mca:3
   bfhjo               fb                  b: f                                                            mfca:3

   bcksp               cbp                 p: c b                       b:   {fca/f/c}                     b:3
   afcelpmn            fcamp               b:   c                       a:   {fc/fc/fc}                    a:3
                                           p:   fcam                                                       af:3
                                           m:   fca                                                        ac:3
                                           a:   fc                                                         afc:3
                                           c:   f
                                                                                                           c:3
                                                                        c:   {f/f/f}                       cf:3


     Ed Chang                                             EMMDS 2009                                                      27
Collaborative Filtering
                                                            Indicators/Diseases
                                                    1 1 1
                                                1       1 1         1       1       1
Based on membership so far,                                     1       1           1




                                      als
 and memberships of others                      1       1       1 1




                              Individua
                                                    1
                                                                    1 1
Predict further membership                              1                   1
                                            1 1
                                                1                               1
                                            1                                       1
                                                1 1 1 1 1

 Ed Chang                     EMMDS 2009                                            28
Latent Semantic Analysis
•    Search                                                                                       Users/Music/Ads/Question

      – Construct a latent layer for better 
         for semantic matching




                                                                                      ers
                                                                          sic/Ads/Answe
•    Example:
      – iPhone crack
      – Apple pie




                                                                  Users/Mus
               1 recipe pastry for a 9
Documents        inch double crust       How to install apps on
                  9 apples, 2/1 cup,     Apple mobile phones?
                     brown sugar


   Topic
Distribution



                                                                                            •   Other Collaborative Filtering Apps
                                                                                                Other Collaborative Filtering Apps
                                                                                                 –   Recommend Users  Users
   Topic
Distribution
                                                                                                 –   Recommend Music  Users
                                                                                                 –   Recommend Ads  Users
                                                                                                 –   Recommend Answers  Q
                                                                                                     R       dA
                 iPhone crack                Apple pie
User quries                                                                                 •   Predict the ? In the light‐blue cells
Ed Chang                                         EMMDS 2009                                                                        29
Documents, Topics, Words
           Documents, Topics, Words
• A document consists of a number of topics
  A document consists of a number of topics
    – A document is a probabilistic mixture of topics
• Each topic generates a number of words
  Each topic generates a number of words
    – A topic is a distribution over words
    – The probability of the ith word in a document




Ed Chang                 EMMDS 2009                     30
Latent Dirichlet Allocation [M. Jordan 04]
• α: uniform Dirichlet φ prior 
                                            α
  for per document d topic 
  for per document d topic
  distribution (corpus level 
  parameter)
                                            θ
• β: uniform Dirichlet φ prior 
          f          hl
  for per topic z word 
  distribution (corpus level 
  parameter)                                z
• θd is the topic distribution of 
  doc d (document level)
          (
• zdj the topic if the jth word in          w            φ            β
  d, wdj the specific word (word                Nm           K
  level)
       ) 
                                                     M
Ed Chang                       EMMDS 2009                        31
LDA Gibbs Sampling: Inputs And Outputs
  LDA Gibbs Sampling: Inputs And Outputs
Inputs: 
1. training data: documents as bags              words    topics
   of words
2. parameter: the number of topics
               h     b    f             docs                   docs

Outputs: 
1. by‐product: a co‐occurrence 
1 by product: a co occurrence
   matrix of topics and documents.
                                        topics
2. model parameters: a co‐                        words
   occurrence matrix of topics and 
   words.



Ed Chang                         EMMDS 2009                    32
Parallel Gibbs Sampling [aaim 09]
            Parallel Gibbs Sampling [aaim 09]
Inputs: 
1. training data: documents as bags              words    topics
   of words
2. parameter: the number of topics
               h     b    f             docs                   docs

Outputs: 
1. by‐product: a co‐occurrence 
1 by product: a co occurrence
   matrix of topics and documents.
                                        topics
2. model parameters: a co‐                        words
   occurrence matrix of topics and 
   words.



Ed Chang                         EMMDS 2009                    33
Ed Chang   EMMDS 2009   34
Ed Chang   EMMDS 2009   35
Outline
  • Motivating Applications 
      – Q&A System
      – Social Ads
  • Key Subroutines
      –    Frequent Itemset Mining [ACM RS 08]
      –    Latent Dirichlet Allocation  [WWW 09, AAIM 09]
           L      Di i hl All      i
      –    Clustering [ECML 08]
      –    UserRank [Google TR 09]
           UserRank [Google TR 09]
      –    Support Vector Machines [NIPS 07]
  • Distributed Computing Perspectives
    Distributed Computing Perspectives
Ed Chang                       EMMDS 2009                   36
Ed Chang   EMMDS 2009   37
Ed Chang   EMMDS 2009   38
Open Social APIs
                           Open Social APIs
                             1


                                 Profiles (who I am)



                                                                                      2


     Stuff (what I have)             Open Social              Friends (who I know)

4




                                 Activities (what I do)


                                                          3




Ed Chang                              EMMDS 2009                                     39
Activities




                                 Recommendations




Applications

  Ed Chang     EMMDS 2009                          40
Ed Chang   EMMDS 2009   41
Ed Chang   EMMDS 2009   42
Ed Chang   EMMDS 2009   43
Ed Chang   EMMDS 2009   44
Task: Targeting Ads at SNS Users
       Task: Targeting Ads at SNS Users
 Users




   Ads




Ed Chang            EMMDS 2009            45
Mining Profiles, Friends & Activities 
             for Relevance
             for Relevance




Ed Chang          EMMDS 2009                46
Consider also User Influence
• Advertisers consider 
  users who are
         h
    – Relevant
    – Influential
• SNS Influence Analysis
    –   Centrality
    –   Credential
    –   Activeness
    –   etc.


Ed Chang              EMMDS 2009          47
Spectral Clustering [A. Ng, M. Jordan]]
              p                g [ g,
• Important subroutine in tasks of machine learning  
  and data mining
    – Exploit pairwise similarity of data instances
    – More effective than traditional methods e g k means
      More effective than traditional methods e.g., k‐means
• Key steps
    – Construct pairwise similarity matrix
      Construct pairwise similarity matrix
           • e.g., using Geodisc distance
    – Compute the Laplacian matrix
    – A l i d
      Apply eigendecomposition
                          iti
    – Perform k‐means


Ed Chang                            EMMDS 2009                48
Scalability Problem
                   Scalability Problem
• Quadratic computation of nxn matrix
• Approximation methods



                            Dense Matrix


       Sparsification          Nystrom               Others



t NN
t-NN   ξ-neighborhood
       ξ neighborhood   …    random      greedy ….

Ed Chang                      EMMDS 2009                      49
Sparsification vs. Sampling
            p                    p g
• Construct the dense         • Randomly sample l
  similarity matrix S
  similarity matrix S           points, where l << n
                                points where l << n
• Sparsify S                  • Construct dense 
• Compute Laplacian
  Compute Laplacian             similarity matrix [A B] 
                                similarity matrix [A B]
  matrix L                      between l and n points
                              • Normalize A and B to be
                                Normalize A and B to be 
                                in Laplacian form
• Apply ARPACLK on L             R = A + A‐1/2BBTA‐1/2 ;
                                 R  A + A              ;  
• Use k‐means to cluster         R = U∑UT
  rows of V into k groups
                              • k‐means
Ed Chang                 EMMDS 2009                          50
Empirical Study [song, et al., ecml 08]
              Empirical Study [song et al ecml 08]
• Dataset: RCV1 (Reuters Corpus Volume I)
    – A filt d ll ti
      A filtered collection of 193,944 d
                             f 193 944 documents in 103
                                              t i 103
      categories
• Photo set: PicasaWeb
    – 637,137 photos
• Experiments
    – Clustering quality vs. computational time
           • Measure the similarity between CAT and CLS 
           • Normalized Mutual Information (NMI)



    – Scalability

Ed Chang                            EMMDS 2009             51
NMI Comparison (on RCV1)
           NMI Comparison (on RCV1)




           Nystrom method
           N t       th d          Sparse matrix approximation
                                   S        ti         i ti
Ed Chang                    EMMDS 2009                           52
Speedup Test on 637,137
      Speedup Test on 637,137 Photos
• K = 1000 clusters




• A hi
  Achiever linear speedup when using 32 machines, after that, 
           li         d    h      i 32       hi    ft th t
  sub‐linear speedup because of increasing communication and 
  sync time

Ed Chang                  EMMDS 2009                        53
Sparsification vs. Sampling
            p                    p g
                  Sparsification      Nystrom, random 
                                      sampling

Information       Full n x n 
                  Full n x n          None
                  similarity scores
Pre‐processing 
Pre processing    O(n2) worst case; O(nl) l << n
                       ) worst case;  O(nl), l << n
Complexity        easily parallizable
(bottleneck)
Effectiveness     Good                Not bad (Jitendra M., 
                                      PAMI)

Ed Chang               EMMDS 2009                        54
Outline
  • Motivating Applications 
      – Q&A System
      – Social Ads
  • Key Subroutines
      –    Frequent Itemset Mining [ACM RS 08]
      –    Latent Dirichlet Allocation  [WWW 09, AAIM 09]
           L      Di i hl All      i
      –    Clustering [ECML 08]
      –    UserRank [Google TR 09]
           UserRank [Google TR 09]
      –    Support Vector Machines [NIPS 07]
  • Distributed Computing Perspectives
    Distributed Computing Perspectives
Ed Chang                       EMMDS 2009                   55
Matrix Factorization Alternatives
      Matrix Factorization Alternatives



   exact




           approximate

Ed Chang                 EMMDS 2009       56
PSVM [E. Chang, et al, NIPS 07]
            PSVM [E Chang et al NIPS 07]
• Column‐based ICF
  Column based ICF
    – Slower than row‐based on single machine
    – Parallelizable on multiple machines
      Parallelizable on multiple machines
• Changing IPM computation order to achieve 
  parallelization
      ll li ti




Ed Chang               EMMDS 2009               57
Speedup




Ed Chang    EMMDS 2009   58
Overheads




Ed Chang     EMMDS 2009   59
Comparison between Parallel 
             Computing Frameworks
                                k
                                   MapReduce        Project B        MPI
GFS/IO and task rescheduling       Yes              No               No
overhead between iterations                         +1               +1

Flexibility of computation model
          y       p                AllReduce only
                                                y   AllReduce only
                                                                 y   Flexible
                                   +0.5             +0.5             +1
Efficient AllReduce                Yes              Yes              Yes
                                   +1               +1               +1
Recover from faults between        Yes              Yes              Apps
iterations                         +1               +1
Recover from faults within each    Yes              Yes              Apps
iteration                          +1               +1
Final Score for scalable machine   3.5              4.5              5
       g
learning

Ed Chang                           EMMDS 2009                               60
Concluding Remarks
• Applications demand scalable solutions
• Have parallelized key subroutines for mining massive data sets
   –   Spectral Clustering [ECML 08]
   –   Frequent Itemset Mining [ACM RS 08]
   –   PLSA [KDD 08]
   –   LDA [WWW 09]
   –   UserRank 
   –   Support Vector Machines [NIPS 07]
• Relevant papers
   – http://infolab.stanford.edu/~echang/
• Open Source PSVM PLDA
  Open Source PSVM, PLDA
   – http://code.google.com/p/psvm/
   – http://code.google.com/p/plda/


 Ed Chang                         EMMDS 2009                   61
Collaborators
   •   Prof. Chih‐Jen Lin (NTU)
                          (   )
   •   Hongjie Bai (Google)
   •   Wen‐Yen Chen (UCSB)
   •   Jon Chu (MIT)
                (   )
   •   Haoyuan Li (PKU)
   •   Yangqiu Song (Tsinghua)
       Yangqiu Song (Tsinghua)
   •   Matt Stanton (CMU)
   •   Yi Wang (Google)
              g(     g )
   •   Dong Zhang (Google)
   •   Kaihua Zhu (Google)

Ed Chang                  EMMDS 2009   62
References
[1] Alexa internet. http://www.alexa.com/.
[ ]
[2] D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In Proc. of the 21st international
                                                                        p
      conference on Machine learning, pages 373-380, 2004.
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-
      1022, 2003.
[4] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the Seventeenth
      International Conference on Machine Learning, pages 167-174, 2000.
[ ]
[5] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In
                                           g         p                                              yp                y
      Advances in Neural Information Processing Systems 13, pages 430-436, 2001.
[6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic
      analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm.
      Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38, 1977.
[8] S Ge a a d D. Ge a S oc as c relaxation, g bbs d s bu o s, a d the bayes a restoration o images. IEEE
    S. Geman and        Geman. Stochastic e a a o , gibbs distributions, and e bayesian es o a o of ages
      Transactions on Pattern recognition and Machine Intelligence, 6:721-741, 1984.
[9] T. Hofmann. Probabilistic latent semantic indexing. In Proc. of Uncertainty in Arti
      cial Intelligence, pages 289-296, 1999.
[10] T. Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information System, 22(1):89-
      115, 2004.
[11] A. McCallum, A. Corrada Emmanuel, and X. Wang. The author recipient topic model for topic and role discovery in
                        Corrada-Emmanuel,                         author-recipient-topic
      social networks: Experiments with enron and academic email. Technical report, Computer Science, University of
      Massachusetts Amherst, 2004.
[12] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In
      Advances in Neural Information Processing Systems 20, 2007.
[13] M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):91-121, 2002.



Ed Chang                                           EMMDS 2009                                                       63
References (cont )
                                     (cont.)
[14] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative
      ltering. In Proc. Of the 24th international conference on Machine learning, p g 791-798, 2007.
             g                                                                   g, pages        ,
[15] E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social
      network. In Proc. of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining,
      pages 678-684, 2005.
[16] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Gri ths. Probabilistic author-topic models for information discovery.
      In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306-
      315, 2004.
           ,
[17] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions.
      Journal on Machine Learning Research (JMLR), 3:583-617, 2002.
[18] T. Zhang and V. S. Iyengar. Recommender systems using linear classi
      ers. Journal of Machine Learning Research, 2:313-334, 2002.
[19] S. Zhong and J. Ghosh. Generative model-based clustering of documents: a comparative study. Knowledge and
      Information Sys e s (
         o a o Systems (KAIS), 8 3 38 , 2005.
                                  S), 8:374-384, 005
[20] L. Admic and E. Adar. How to search a social network. 2004
[21] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, pages
      5228-5235, 2004.
[22] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering.
      Communitcations of the ACM, 3:63-65, 1997.
[23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD
      Rec., 22:207-116, 1993.
[24] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In
      Proceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence, 1998.
[25] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-
      177, 2004.



Ed Chang                                            EMMDS 2009                                                         64
References (cont )
                                     (cont.)
[26] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.
      In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.
[27] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143-
      177, 2004.
[28] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms.
      In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001.
[29] M. Brand. Fast online svd revisions for lightweight recommender systems. In Proceedings of the 3rd SIAM
      International Conference on Data Mining, 2003.
[30] D. Goldbberg, D. Nichols, B. Oki and D. Terry. Using collaborative filtering to weave an information tapestry.
      Communication of ACM 35, 12:61-70, 1992.
[31] P Resnik N Iacovou M Suchak P Bergstrom and J Riedl Grouplens: An open architecture for aollaborative
     P. Resnik, N. Iacovou, M. Suchak, P. Bergstrom,          J. Riedl.
      filtering of netnews. In Proceedings of the ACM, Conference on Computer Supported Cooperative Work. Pages
      175-186, 1994.
[32] J. Konstan, et al. Grouplens: Applying collaborative filtering to usenet news. Communication ofACM 40, 3:77-87,
      1997.
[33] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In
      Proceedings of ACM CHI, 1:210-217, 1995.
[34] G Ki d
     G. Kinden, B S ith and J Y k A
                   B. Smith d J. York. Amazon.com recommendations: it
                                                                  d ti    item-to-item collaborative filt i
                                                                               t it      ll b ti filtering. IEEE I t
                                                                                                                  Internet
                                                                                                                         t
      Computing, 7:76-80, 2003.
[35] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42, 1:177-
      196, 2001.
[36] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of International Joint
      Conference in Artificial Intelligence, 1999.
[ ] p
[37] http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/collaborativefiltering.html
                                           p                                                      g
[38] E. Y. Chang, et. al., Parallelizing Support Vector Machines on Distributed Machines, NIPS, 2007.
[39] Wen-Yen Chen, Dong Zhang, and E. Y. Chang, Combinational Collaborative Filtering for personalized community
      recommendation, ACM KDD 2008.
[40] Y. Sun, W.-Y. Chen, H. Bai, C.-j. Lin, and E. Y. Chang, Parallel Spectral Clustering, ECML 2008.




Ed Chang                                            EMMDS 2009                                                         65

Weitere ähnliche Inhalte

Andere mochten auch

Uv08 Dcii Tema 3 DiseñO Y Gestion Servicios
Uv08 Dcii Tema 3 DiseñO Y Gestion ServiciosUv08 Dcii Tema 3 DiseñO Y Gestion Servicios
Uv08 Dcii Tema 3 DiseñO Y Gestion ServiciosJordi Miro
 
Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...
Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...
Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...Mary Elizabeth Francisco
 
Analisis Grafico 3 Ejercicios
Analisis Grafico 3 EjerciciosAnalisis Grafico 3 Ejercicios
Analisis Grafico 3 EjerciciosMarcos A. Fatela
 
Gestión y diseño de los instrumentos de comunicacion de marketing
Gestión y diseño de los instrumentos de comunicacion de marketingGestión y diseño de los instrumentos de comunicacion de marketing
Gestión y diseño de los instrumentos de comunicacion de marketingJordi Miro
 
Simce 2014
Simce 2014Simce 2014
Simce 201415511
 
Salesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 ConferenceSalesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 ConferenceSteve Greene
 
Science Interactive Notebook
Science Interactive NotebookScience Interactive Notebook
Science Interactive NotebookEboni DuBose
 
Creating Tribal Ideas for Gen C - By Dan Pankraz for Nokia World '09
Creating Tribal Ideas for Gen C  - By Dan Pankraz for Nokia World '09Creating Tribal Ideas for Gen C  - By Dan Pankraz for Nokia World '09
Creating Tribal Ideas for Gen C - By Dan Pankraz for Nokia World '09guestbcb2a7
 
Gridcomputing
GridcomputingGridcomputing
Gridcomputingpchengi
 
20 Quotes To Turn Your Obstacles Into Opportunities
20 Quotes To Turn Your Obstacles Into Opportunities20 Quotes To Turn Your Obstacles Into Opportunities
20 Quotes To Turn Your Obstacles Into OpportunitiesRyan Holiday
 
Middleware Basics
Middleware BasicsMiddleware Basics
Middleware BasicsVarun Arora
 
Business Model Knowledge Fair, Amsterdam
Business Model Knowledge Fair, AmsterdamBusiness Model Knowledge Fair, Amsterdam
Business Model Knowledge Fair, AmsterdamAlexander Osterwalder
 
Trabajo punto de equilibrio
Trabajo punto  de  equilibrioTrabajo punto  de  equilibrio
Trabajo punto de equilibriohvenegas03
 
The Performance Appraisal System.
The Performance Appraisal System.The Performance Appraisal System.
The Performance Appraisal System.Binit Das
 
Ejercicios del razonamiento cuantitativo
Ejercicios del razonamiento cuantitativoEjercicios del razonamiento cuantitativo
Ejercicios del razonamiento cuantitativoClaudio Paguay
 

Andere mochten auch (20)

Practical Object Oriented Models In Sql
Practical Object Oriented Models In SqlPractical Object Oriented Models In Sql
Practical Object Oriented Models In Sql
 
Uv08 Dcii Tema 3 DiseñO Y Gestion Servicios
Uv08 Dcii Tema 3 DiseñO Y Gestion ServiciosUv08 Dcii Tema 3 DiseñO Y Gestion Servicios
Uv08 Dcii Tema 3 DiseñO Y Gestion Servicios
 
Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...
Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...
Ati flash cards 09, medications affecting fluid, electrolytes, minerals, and ...
 
Analisis Grafico 3 Ejercicios
Analisis Grafico 3 EjerciciosAnalisis Grafico 3 Ejercicios
Analisis Grafico 3 Ejercicios
 
Gestión y diseño de los instrumentos de comunicacion de marketing
Gestión y diseño de los instrumentos de comunicacion de marketingGestión y diseño de los instrumentos de comunicacion de marketing
Gestión y diseño de los instrumentos de comunicacion de marketing
 
Simce 2014
Simce 2014Simce 2014
Simce 2014
 
Salesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 ConferenceSalesforce.com Agile Transformation - Agile 2007 Conference
Salesforce.com Agile Transformation - Agile 2007 Conference
 
Science Interactive Notebook
Science Interactive NotebookScience Interactive Notebook
Science Interactive Notebook
 
Sports MKT 2011 - Daniel Sá
Sports MKT 2011 - Daniel SáSports MKT 2011 - Daniel Sá
Sports MKT 2011 - Daniel Sá
 
Creating Tribal Ideas for Gen C - By Dan Pankraz for Nokia World '09
Creating Tribal Ideas for Gen C  - By Dan Pankraz for Nokia World '09Creating Tribal Ideas for Gen C  - By Dan Pankraz for Nokia World '09
Creating Tribal Ideas for Gen C - By Dan Pankraz for Nokia World '09
 
Gridcomputing
GridcomputingGridcomputing
Gridcomputing
 
Your JavaScript Library
Your JavaScript LibraryYour JavaScript Library
Your JavaScript Library
 
THE THEORY OF PUBLIC ADMINISTRATION
THE THEORY OF PUBLIC ADMINISTRATIONTHE THEORY OF PUBLIC ADMINISTRATION
THE THEORY OF PUBLIC ADMINISTRATION
 
20 Quotes To Turn Your Obstacles Into Opportunities
20 Quotes To Turn Your Obstacles Into Opportunities20 Quotes To Turn Your Obstacles Into Opportunities
20 Quotes To Turn Your Obstacles Into Opportunities
 
Middleware Basics
Middleware BasicsMiddleware Basics
Middleware Basics
 
Business Model Knowledge Fair, Amsterdam
Business Model Knowledge Fair, AmsterdamBusiness Model Knowledge Fair, Amsterdam
Business Model Knowledge Fair, Amsterdam
 
Trabajo punto de equilibrio
Trabajo punto  de  equilibrioTrabajo punto  de  equilibrio
Trabajo punto de equilibrio
 
C language ppt
C language pptC language ppt
C language ppt
 
The Performance Appraisal System.
The Performance Appraisal System.The Performance Appraisal System.
The Performance Appraisal System.
 
Ejercicios del razonamiento cuantitativo
Ejercicios del razonamiento cuantitativoEjercicios del razonamiento cuantitativo
Ejercicios del razonamiento cuantitativo
 

Ähnlich wie EdChang - Parallel Algorithms For Mining Large Scale Data

2010 09-17-張志威老師-confucius & “its” intelligent disciples
2010 09-17-張志威老師-confucius & “its” intelligent disciples2010 09-17-張志威老師-confucius & “its” intelligent disciples
2010 09-17-張志威老師-confucius & “its” intelligent disciplesnccuscience
 
Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...
Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...
Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...Denis Parra Santander
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with dataONE Talks
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine LearningONE Talks
 
From Learning Networks to Digital Ecosystems
From Learning Networks to Digital EcosystemsFrom Learning Networks to Digital Ecosystems
From Learning Networks to Digital EcosystemsHendrik Drachsler
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommendergu wendong
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine DemystifiedDKALab
 
Online Social Networks
Online Social NetworksOnline Social Networks
Online Social Networksoiisdp2010
 

Ähnlich wie EdChang - Parallel Algorithms For Mining Large Scale Data (10)

2010 09-17-張志威老師-confucius & “its” intelligent disciples
2010 09-17-張志威老師-confucius & “its” intelligent disciples2010 09-17-張志威老師-confucius & “its” intelligent disciples
2010 09-17-張志威老師-confucius & “its” intelligent disciples
 
Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...
Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...
Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Re...
 
Machine Learning: Learning with data
Machine Learning: Learning with dataMachine Learning: Learning with data
Machine Learning: Learning with data
 
One talk Machine Learning
One talk Machine LearningOne talk Machine Learning
One talk Machine Learning
 
From Learning Networks to Digital Ecosystems
From Learning Networks to Digital EcosystemsFrom Learning Networks to Digital Ecosystems
From Learning Networks to Digital Ecosystems
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Recommendation Engine Demystified
Recommendation Engine DemystifiedRecommendation Engine Demystified
Recommendation Engine Demystified
 
Online Social Networks
Online Social NetworksOnline Social Networks
Online Social Networks
 
Promise notes
Promise notesPromise notes
Promise notes
 

Mehr von gu wendong

宜信大数据金融云-CSDN
宜信大数据金融云-CSDN宜信大数据金融云-CSDN
宜信大数据金融云-CSDNgu wendong
 
Social Recommendation
Social RecommendationSocial Recommendation
Social Recommendationgu wendong
 
Pharos Social Map Based Recommendation For Content Centric Social Websites
Pharos Social Map Based Recommendation For Content Centric Social WebsitesPharos Social Map Based Recommendation For Content Centric Social Websites
Pharos Social Map Based Recommendation For Content Centric Social Websitesgu wendong
 
Resys China 创刊号
Resys China 创刊号Resys China 创刊号
Resys China 创刊号gu wendong
 
孙超 - Recommendation Algorithm as a product
孙超 - Recommendation Algorithm as a product孙超 - Recommendation Algorithm as a product
孙超 - Recommendation Algorithm as a productgu wendong
 
王守崑 - 豆瓣在推荐领域的实践和思考
王守崑 - 豆瓣在推荐领域的实践和思考王守崑 - 豆瓣在推荐领域的实践和思考
王守崑 - 豆瓣在推荐领域的实践和思考gu wendong
 
From Search To Discover by Wanght
From Search To Discover by WanghtFrom Search To Discover by Wanght
From Search To Discover by Wanghtgu wendong
 
Understanding Rbm by WangYuanTao
Understanding Rbm by WangYuanTaoUnderstanding Rbm by WangYuanTao
Understanding Rbm by WangYuanTaogu wendong
 
Netflix Prize by Xlvector
Netflix Prize by XlvectorNetflix Prize by Xlvector
Netflix Prize by Xlvectorgu wendong
 

Mehr von gu wendong (9)

宜信大数据金融云-CSDN
宜信大数据金融云-CSDN宜信大数据金融云-CSDN
宜信大数据金融云-CSDN
 
Social Recommendation
Social RecommendationSocial Recommendation
Social Recommendation
 
Pharos Social Map Based Recommendation For Content Centric Social Websites
Pharos Social Map Based Recommendation For Content Centric Social WebsitesPharos Social Map Based Recommendation For Content Centric Social Websites
Pharos Social Map Based Recommendation For Content Centric Social Websites
 
Resys China 创刊号
Resys China 创刊号Resys China 创刊号
Resys China 创刊号
 
孙超 - Recommendation Algorithm as a product
孙超 - Recommendation Algorithm as a product孙超 - Recommendation Algorithm as a product
孙超 - Recommendation Algorithm as a product
 
王守崑 - 豆瓣在推荐领域的实践和思考
王守崑 - 豆瓣在推荐领域的实践和思考王守崑 - 豆瓣在推荐领域的实践和思考
王守崑 - 豆瓣在推荐领域的实践和思考
 
From Search To Discover by Wanght
From Search To Discover by WanghtFrom Search To Discover by Wanght
From Search To Discover by Wanght
 
Understanding Rbm by WangYuanTao
Understanding Rbm by WangYuanTaoUnderstanding Rbm by WangYuanTao
Understanding Rbm by WangYuanTao
 
Netflix Prize by Xlvector
Netflix Prize by XlvectorNetflix Prize by Xlvector
Netflix Prize by Xlvector
 

EdChang - Parallel Algorithms For Mining Large Scale Data

  • 1. Parallel Algorithms for Mining  g g Large‐ Large‐Scale Data Edward Chang Ed d Ch Director, Google Research, Beijing http://infolab.stanford.edu/ echang/ http://infolab stanford edu/~echang/ Ed Chang EMMDS 2009 1
  • 2. Ed Chang EMMDS 2009 2
  • 3. Ed Chang EMMDS 2009 3
  • 4. Story of the Photos Story of the Photos • The photos on the first pages are statues of two “bookkeepers” displayed at the British Museum. One bookkeeper keeps a list of good people, and the other a list of bad. (Who is who, can you tell ? ☺) • When I first visited the museum in 1998, I did not take a photo of them to conserve films. During this trip (June 2009), capacity is no longer a concern or constraint. In fact, one can see kid grandmas all t ki photos, a l t of photos. T kids, d ll taking h t lot f h t Ten years apart, data volume explodes. • Data complexity also grows. • So, can these ancient “bookkeepers” still classify good from bad? Is their capacity scalable to the data dimension and data quantity ? Ed Chang EMMDS 2009 4
  • 5. Ed Chang EMMDS 2009 5
  • 6. Outline • Motivating Applications  – Q&A System – Social Ads • Key Subroutines – Frequent Itemset Mining [ACM RS 08] – Latent Dirichlet Allocation  [WWW 09, AAIM 09] L Di i hl All i – Clustering [ECML 08] – UserRank [Google TR 09] UserRank [Google TR 09] – Support Vector Machines [NIPS 07] • Distributed Computing Perspectives Distributed Computing Perspectives Ed Chang EMMDS 2009 6
  • 11. Query: Must‐see attractions at Beijing Hotel ads Ed Chang EMMDS 2009 11
  • 13. Q&A Yao Ming Q&A Yao Ming Ed Chang EMMDS 2009 13
  • 14. Who is Yao Ming Yao Ming Related Q&As Y Mi R l t d Q&A Ed Chang EMMDS 2009 14
  • 15. Application: Google Q&A (Confucius) launched in China, Thailand, and Russia launched in China Thailand and Russia Differentiated Content DC U S C Search U Community CCS/Q&A Discussion/Question Trigger a discussion/question session during search Provide labels to a Q (semi-automatically) Given a Q, find similar Qs and their As (automatically) Evaluate quality of an answer, relevance and originality Evaluate E l t user credentials i a t i sensitive way d ti l in topic iti Route questions to experts Provide most relevant, high-quality content for Search to index Ed Chang EMMDS 2009 15
  • 16. Q&A Uses Machine Learning Label suggestion using ML algorithms. • Real Time topic-to- topic (T2T) recommendation using ML algorithms. • Gi Gives out related hi h t l t d high quality links to previous questions before human answer appear. Ed Chang EMMDS 2009 16
  • 17. Collaborative Filtering Books/Photos 1 1 1 1 1 1 1 1 1 Based on membership so far, 1 1 1 and memberships of others 1 1 1 1 1 Users 1 1 Predict further membership 1 1 1 1 1 1 1 1 1 1 1 1 1 Ed Chang EMMDS 2009 17
  • 18. Collaborative Filtering Books/Photos Based on partially ? 1 1 1 ? observed matrix 1 ? 1 1 1 1 1 1 ? 1 1 1 1 ? 1 1 Predict unobserved entries 1 ? Users ? 1 1 U 1 1 ? I. Will user i enjoy photo j? 1 1 ? 1 ? 1 II. Will user i be interesting to user j? 1 ? 1 III. Will photo i be related to photo j? 1 1 1 1 1 ? Ed Chang EMMDS 2009 18
  • 19. FIM based Recommendation FIM‐based Recommendation Ed Chang EMMDS 2009 19
  • 20. FIM Preliminaries • Observation 1: If an item A is not frequent, any pattern contains  A won’t be frequent [R. Agrawal] q use a threshold to eliminate infrequent items  {A}   {A,B} • Observation 2: Patterns containing A are subsets of (or found  from) transactions containing A [J. Han] from) transactions containing A [J. Han] divide‐and‐conquer: select transactions containing A to form a  conditional database (CDB), and find patterns containing A from  that conditional database {A, B}, {A, C}, {A}   CDB A {A, B}, {B, C}  CDB B • Observation 3: To prevent the same pattern from being found in Observation 3: To prevent the same pattern from being found in  multiple CDBs, all itemsets are sorted by the same manner (e.g.,  by descending support) Ed Chang EMMDS 2009 20
  • 21. Preprocessing • According to  f: 4 Observation 1, we  c: 4 count the support of  h f a: 3 facdgimp b: 3 fcamp each item by  m: 3 scanning the  abcflmo p: 3 fcabm database, and  database and eliminate those  bfhjo o: 2 fb infrequent items  d: 1 from the  from the bcksp e: 1 cbp transactions. g: 1 • According to  afcelpmn h: 1 fcamp i: 1 Observation 3, we  Observation 3, we k: 1 sort items in each  l:1 transaction by the  n: 1 order of descending g support value. Ed Chang EMMDS 2009 21
  • 22. Parallel Projection • According to Observation 2, we construct CDB of item  A; then from this CDB, we find those patterns  ; , p containing A • How to construct the CDB of A?  – If t If a transaction contains A, this transaction should appear in  ti t i A thi t ti h ld i the CDB of A – Given a transaction {B, A, C}, it should appear in the CDB of  A, the CDB of B, and the CDB of C A the CDB of B and the CDB of C • Dedup solution: using the order of items: – sort {B,A,C} by the order of items  <A,B,C> – Put <> into the CDB of A – Put <A> into the CDB of B – Put <A,B> into the CDB of C Put <A B> into the CDB of C Ed Chang EMMDS 2009 22
  • 23. Example of Projection fcamp p: { f c a m / f c a m / c b } fcabm m: { f c a / f c a / f c a b } fb b: { f c a / f / c } cbp a: { f c / f c / f c } fcamp c: { f / f / f } Example of Projection of a database into CDBs. Left: sorted transactions in order of f, c, a, b, m, p Right: conditional databases of frequent items Ed Chang EMMDS 2009 23
  • 24. Example of Projection p j fcamp p: { f c a m / f c a m / c b } fcabm m: { f c a / f c a / f c a b } fb b: { f c a / f / c } cbp a: { f c / f c / f c } fcamp c: { f / f / f } Example of Projection of a database into CDBs. Left: sorted transactions; Right: conditional databases of frequent items Ed Chang EMMDS 2009 24
  • 25. Example of Projection Example of Projection fcamp p: { f c a m / f c a m / c b } fcabm m: { f c a / f c a / f c a b } fb b: { f c a / f / c } cbp a: { f c / f c / f c } fcamp c: { f / f / f } Example of Projection of a database into CDBs. Left: sorted transactions; Right: conditional databases of frequent items Ed Chang EMMDS 2009 25
  • 26. Recursive Projections [H. Li, et al. ACM RS 08] Recursive Projections [H Li et al ACM RS 08] MapReduce MapReduce MapReduce • Recursive projection form  Iteration 1 Iteration 2 Iteration 3 a search tree a search tree • Each node is a CDB b D|ab c D|abc • Using the order of items to  a D|a prevent duplicated CDBs. t d li t d CDB c D|ac • Each level of breath‐first  search of the tree can be  D c D|bc done by a MapReduce  done by a MapReduce b D|b iteration. • Once a CDB is small  enough to fit in memory,  enough to fit in memory c D|c we can invoke FP‐growth  to mine this CDB, and no  more growth of the sub more growth of the sub‐ tree. Ed Chang EMMDS 2009 26
  • 27. Projection using MapReduce Map inputs Sorted transactions Map outputs Reduce inputs Reduce outputs (transactions) (with infrequent (conditional transactions) (conditional databases) (patterns and supports) key="": value items eliminated) key: value key: value key: value facdgimp fcamp p: m: fcam fca p:{fcam/fcam/cb} p:3, pc:3 p: { f c a m / f c a m / c b } p:3 pc:3 a: fc c: f mf:3 mc:3 abcflmo fcabm m: f c a b ma:3 b: f c a m: { f c a / f c a / f c a b } mfc:3 a: f c mfa:3 c: f mca:3 bfhjo fb b: f mfca:3 bcksp cbp p: c b b: {fca/f/c} b:3 afcelpmn fcamp b: c a: {fc/fc/fc} a:3 p: fcam af:3 m: fca ac:3 a: fc afc:3 c: f c:3 c: {f/f/f} cf:3 Ed Chang EMMDS 2009 27
  • 28. Collaborative Filtering Indicators/Diseases 1 1 1 1 1 1 1 1 1 Based on membership so far, 1 1 1 als and memberships of others 1 1 1 1 Individua 1 1 1 Predict further membership 1 1 1 1 1 1 1 1 1 1 1 1 1 Ed Chang EMMDS 2009 28
  • 29. Latent Semantic Analysis • Search Users/Music/Ads/Question – Construct a latent layer for better  for semantic matching ers sic/Ads/Answe • Example: – iPhone crack – Apple pie Users/Mus 1 recipe pastry for a 9 Documents inch double crust How to install apps on 9 apples, 2/1 cup, Apple mobile phones? brown sugar Topic Distribution • Other Collaborative Filtering Apps Other Collaborative Filtering Apps – Recommend Users  Users Topic Distribution – Recommend Music  Users – Recommend Ads  Users – Recommend Answers  Q R dA iPhone crack Apple pie User quries • Predict the ? In the light‐blue cells Ed Chang EMMDS 2009 29
  • 30. Documents, Topics, Words Documents, Topics, Words • A document consists of a number of topics A document consists of a number of topics – A document is a probabilistic mixture of topics • Each topic generates a number of words Each topic generates a number of words – A topic is a distribution over words – The probability of the ith word in a document Ed Chang EMMDS 2009 30
  • 31. Latent Dirichlet Allocation [M. Jordan 04] • α: uniform Dirichlet φ prior  α for per document d topic  for per document d topic distribution (corpus level  parameter) θ • β: uniform Dirichlet φ prior  f hl for per topic z word  distribution (corpus level  parameter) z • θd is the topic distribution of  doc d (document level) ( • zdj the topic if the jth word in  w φ β d, wdj the specific word (word  Nm K level) )  M Ed Chang EMMDS 2009 31
  • 32. LDA Gibbs Sampling: Inputs And Outputs LDA Gibbs Sampling: Inputs And Outputs Inputs:  1. training data: documents as bags  words topics of words 2. parameter: the number of topics h b f docs docs Outputs:  1. by‐product: a co‐occurrence  1 by product: a co occurrence matrix of topics and documents. topics 2. model parameters: a co‐ words occurrence matrix of topics and  words. Ed Chang EMMDS 2009 32
  • 33. Parallel Gibbs Sampling [aaim 09] Parallel Gibbs Sampling [aaim 09] Inputs:  1. training data: documents as bags  words topics of words 2. parameter: the number of topics h b f docs docs Outputs:  1. by‐product: a co‐occurrence  1 by product: a co occurrence matrix of topics and documents. topics 2. model parameters: a co‐ words occurrence matrix of topics and  words. Ed Chang EMMDS 2009 33
  • 34. Ed Chang EMMDS 2009 34
  • 35. Ed Chang EMMDS 2009 35
  • 36. Outline • Motivating Applications  – Q&A System – Social Ads • Key Subroutines – Frequent Itemset Mining [ACM RS 08] – Latent Dirichlet Allocation  [WWW 09, AAIM 09] L Di i hl All i – Clustering [ECML 08] – UserRank [Google TR 09] UserRank [Google TR 09] – Support Vector Machines [NIPS 07] • Distributed Computing Perspectives Distributed Computing Perspectives Ed Chang EMMDS 2009 36
  • 37. Ed Chang EMMDS 2009 37
  • 38. Ed Chang EMMDS 2009 38
  • 39. Open Social APIs Open Social APIs 1 Profiles (who I am) 2 Stuff (what I have) Open Social Friends (who I know) 4 Activities (what I do) 3 Ed Chang EMMDS 2009 39
  • 40. Activities Recommendations Applications Ed Chang EMMDS 2009 40
  • 41. Ed Chang EMMDS 2009 41
  • 42. Ed Chang EMMDS 2009 42
  • 43. Ed Chang EMMDS 2009 43
  • 44. Ed Chang EMMDS 2009 44
  • 45. Task: Targeting Ads at SNS Users Task: Targeting Ads at SNS Users Users Ads Ed Chang EMMDS 2009 45
  • 46. Mining Profiles, Friends & Activities  for Relevance for Relevance Ed Chang EMMDS 2009 46
  • 47. Consider also User Influence • Advertisers consider  users who are h – Relevant – Influential • SNS Influence Analysis – Centrality – Credential – Activeness – etc. Ed Chang EMMDS 2009 47
  • 48. Spectral Clustering [A. Ng, M. Jordan]] p g [ g, • Important subroutine in tasks of machine learning   and data mining – Exploit pairwise similarity of data instances – More effective than traditional methods e g k means More effective than traditional methods e.g., k‐means • Key steps – Construct pairwise similarity matrix Construct pairwise similarity matrix • e.g., using Geodisc distance – Compute the Laplacian matrix – A l i d Apply eigendecomposition iti – Perform k‐means Ed Chang EMMDS 2009 48
  • 49. Scalability Problem Scalability Problem • Quadratic computation of nxn matrix • Approximation methods Dense Matrix Sparsification Nystrom Others t NN t-NN ξ-neighborhood ξ neighborhood … random greedy …. Ed Chang EMMDS 2009 49
  • 50. Sparsification vs. Sampling p p g • Construct the dense  • Randomly sample l similarity matrix S similarity matrix S points, where l << n points where l << n • Sparsify S • Construct dense  • Compute Laplacian Compute Laplacian  similarity matrix [A B]  similarity matrix [A B] matrix L  between l and n points • Normalize A and B to be Normalize A and B to be  in Laplacian form • Apply ARPACLK on L R = A + A‐1/2BBTA‐1/2 ; R  A + A ;   • Use k‐means to cluster  R = U∑UT rows of V into k groups • k‐means Ed Chang EMMDS 2009 50
  • 51. Empirical Study [song, et al., ecml 08] Empirical Study [song et al ecml 08] • Dataset: RCV1 (Reuters Corpus Volume I) – A filt d ll ti A filtered collection of 193,944 d f 193 944 documents in 103 t i 103 categories • Photo set: PicasaWeb – 637,137 photos • Experiments – Clustering quality vs. computational time • Measure the similarity between CAT and CLS  • Normalized Mutual Information (NMI) – Scalability Ed Chang EMMDS 2009 51
  • 52. NMI Comparison (on RCV1) NMI Comparison (on RCV1) Nystrom method N t th d Sparse matrix approximation S ti i ti Ed Chang EMMDS 2009 52
  • 53. Speedup Test on 637,137 Speedup Test on 637,137 Photos • K = 1000 clusters • A hi Achiever linear speedup when using 32 machines, after that,  li d h i 32 hi ft th t sub‐linear speedup because of increasing communication and  sync time Ed Chang EMMDS 2009 53
  • 54. Sparsification vs. Sampling p p g Sparsification Nystrom, random  sampling Information Full n x n  Full n x n None similarity scores Pre‐processing  Pre processing O(n2) worst case; O(nl) l << n ) worst case;  O(nl), l << n Complexity  easily parallizable (bottleneck) Effectiveness Good Not bad (Jitendra M.,  PAMI) Ed Chang EMMDS 2009 54
  • 55. Outline • Motivating Applications  – Q&A System – Social Ads • Key Subroutines – Frequent Itemset Mining [ACM RS 08] – Latent Dirichlet Allocation  [WWW 09, AAIM 09] L Di i hl All i – Clustering [ECML 08] – UserRank [Google TR 09] UserRank [Google TR 09] – Support Vector Machines [NIPS 07] • Distributed Computing Perspectives Distributed Computing Perspectives Ed Chang EMMDS 2009 55
  • 56. Matrix Factorization Alternatives Matrix Factorization Alternatives exact approximate Ed Chang EMMDS 2009 56
  • 57. PSVM [E. Chang, et al, NIPS 07] PSVM [E Chang et al NIPS 07] • Column‐based ICF Column based ICF – Slower than row‐based on single machine – Parallelizable on multiple machines Parallelizable on multiple machines • Changing IPM computation order to achieve  parallelization ll li ti Ed Chang EMMDS 2009 57
  • 58. Speedup Ed Chang EMMDS 2009 58
  • 59. Overheads Ed Chang EMMDS 2009 59
  • 60. Comparison between Parallel  Computing Frameworks k MapReduce Project B MPI GFS/IO and task rescheduling Yes No No overhead between iterations +1 +1 Flexibility of computation model y p AllReduce only y AllReduce only y Flexible +0.5 +0.5 +1 Efficient AllReduce Yes Yes Yes +1 +1 +1 Recover from faults between Yes Yes Apps iterations +1 +1 Recover from faults within each Yes Yes Apps iteration +1 +1 Final Score for scalable machine 3.5 4.5 5 g learning Ed Chang EMMDS 2009 60
  • 61. Concluding Remarks • Applications demand scalable solutions • Have parallelized key subroutines for mining massive data sets – Spectral Clustering [ECML 08] – Frequent Itemset Mining [ACM RS 08] – PLSA [KDD 08] – LDA [WWW 09] – UserRank  – Support Vector Machines [NIPS 07] • Relevant papers – http://infolab.stanford.edu/~echang/ • Open Source PSVM PLDA Open Source PSVM, PLDA – http://code.google.com/p/psvm/ – http://code.google.com/p/plda/ Ed Chang EMMDS 2009 61
  • 62. Collaborators • Prof. Chih‐Jen Lin (NTU) ( ) • Hongjie Bai (Google) • Wen‐Yen Chen (UCSB) • Jon Chu (MIT) ( ) • Haoyuan Li (PKU) • Yangqiu Song (Tsinghua) Yangqiu Song (Tsinghua) • Matt Stanton (CMU) • Yi Wang (Google) g( g ) • Dong Zhang (Google) • Kaihua Zhu (Google) Ed Chang EMMDS 2009 62
  • 63. References [1] Alexa internet. http://www.alexa.com/. [ ] [2] D. M. Blei and M. I. Jordan. Variational methods for the dirichlet process. In Proc. of the 21st international p conference on Machine learning, pages 373-380, 2004. [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993- 1022, 2003. [4] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the Seventeenth International Conference on Machine Learning, pages 167-174, 2000. [ ] [5] D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In g p yp y Advances in Neural Information Processing Systems 13, pages 430-436, 2001. [6] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391-407, 1990. [7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38, 1977. [8] S Ge a a d D. Ge a S oc as c relaxation, g bbs d s bu o s, a d the bayes a restoration o images. IEEE S. Geman and Geman. Stochastic e a a o , gibbs distributions, and e bayesian es o a o of ages Transactions on Pattern recognition and Machine Intelligence, 6:721-741, 1984. [9] T. Hofmann. Probabilistic latent semantic indexing. In Proc. of Uncertainty in Arti cial Intelligence, pages 289-296, 1999. [10] T. Hofmann. Latent semantic models for collaborative filtering. ACM Transactions on Information System, 22(1):89- 115, 2004. [11] A. McCallum, A. Corrada Emmanuel, and X. Wang. The author recipient topic model for topic and role discovery in Corrada-Emmanuel, author-recipient-topic social networks: Experiments with enron and academic email. Technical report, Computer Science, University of Massachusetts Amherst, 2004. [12] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In Advances in Neural Information Processing Systems 20, 2007. [13] M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):91-121, 2002. Ed Chang EMMDS 2009 63
  • 64. References (cont ) (cont.) [14] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative ltering. In Proc. Of the 24th international conference on Machine learning, p g 791-798, 2007. g g, pages , [15] E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In Proc. of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining, pages 678-684, 2005. [16] M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Gri ths. Probabilistic author-topic models for information discovery. In Proc. of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 306- 315, 2004. , [17] A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research (JMLR), 3:583-617, 2002. [18] T. Zhang and V. S. Iyengar. Recommender systems using linear classi ers. Journal of Machine Learning Research, 2:313-334, 2002. [19] S. Zhong and J. Ghosh. Generative model-based clustering of documents: a comparative study. Knowledge and Information Sys e s ( o a o Systems (KAIS), 8 3 38 , 2005. S), 8:374-384, 005 [20] L. Admic and E. Adar. How to search a social network. 2004 [21] T.L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, pages 5228-5235, 2004. [22] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative filtering. Communitcations of the ACM, 3:63-65, 1997. [23] R. Agrawal, T. Imielnski, A. Swami. Mining association rules between sets of items in large databses. SIGMOD Rec., 22:207-116, 1993. [24] J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Conference on Uncertainty in Artifical Intelligence, 1998. [25] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143- 177, 2004. Ed Chang EMMDS 2009 64
  • 65. References (cont ) (cont.) [26] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001. [27] M.Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst., 22(1):143- 177, 2004. [28] B.M. Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International World Wide Web Conference, pages 285-295, 2001. [29] M. Brand. Fast online svd revisions for lightweight recommender systems. In Proceedings of the 3rd SIAM International Conference on Data Mining, 2003. [30] D. Goldbberg, D. Nichols, B. Oki and D. Terry. Using collaborative filtering to weave an information tapestry. Communication of ACM 35, 12:61-70, 1992. [31] P Resnik N Iacovou M Suchak P Bergstrom and J Riedl Grouplens: An open architecture for aollaborative P. Resnik, N. Iacovou, M. Suchak, P. Bergstrom, J. Riedl. filtering of netnews. In Proceedings of the ACM, Conference on Computer Supported Cooperative Work. Pages 175-186, 1994. [32] J. Konstan, et al. Grouplens: Applying collaborative filtering to usenet news. Communication ofACM 40, 3:77-87, 1997. [33] U. Shardanand and P. Maes. Social information filtering: Algorithms for automating “word of mouth”. In Proceedings of ACM CHI, 1:210-217, 1995. [34] G Ki d G. Kinden, B S ith and J Y k A B. Smith d J. York. Amazon.com recommendations: it d ti item-to-item collaborative filt i t it ll b ti filtering. IEEE I t Internet t Computing, 7:76-80, 2003. [35] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal 42, 1:177- 196, 2001. [36] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of International Joint Conference in Artificial Intelligence, 1999. [ ] p [37] http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/collaborativefiltering.html p g [38] E. Y. Chang, et. al., Parallelizing Support Vector Machines on Distributed Machines, NIPS, 2007. [39] Wen-Yen Chen, Dong Zhang, and E. Y. Chang, Combinational Collaborative Filtering for personalized community recommendation, ACM KDD 2008. [40] Y. Sun, W.-Y. Chen, H. Bai, C.-j. Lin, and E. Y. Chang, Parallel Spectral Clustering, ECML 2008. Ed Chang EMMDS 2009 65