SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
Modeling Science

           David M. Blei

     Department of Computer Science
          Princeton University


           April 17, 2008




Joint work with John Lafferty (CMU)

             D. Blei   Modeling Science   1 / 53
Modeling Science
           Science, August 13, 1886 Science, June 24, 1994



                                                                  evolution      rna          disease
                        water       acid          disease         evolutionary   mrna         host
                        milk        water         blood           species        site         bacteria
                        food        solution      cholera         organisms      splicing     diseases
                        dry         experiments   bacteria        biology        rnas         new
                        fed         liquid        found           phylogenetic   nuclear      bacterial
                        cows        chemical      bacillus        life           sequence     resistance
                        houses      action        experiments     origin         introns      control
                        butter      copper        organisms       diversity      messenger    strains
                        fat         crystals      bacilli
                                                                  groups         cleavage     infectious
                        found       carbon        cases
                        made        alcohol       diseases
                                                                  molecular      two          malaria
                        contained   made          germs           animals        splice       parasites
                        wells       obtained      animal          two            sequences    parasite
                        produced    substances    koch            new            polymerase   tuberculosis
                        poisonous   nitrogen      made            living         intron       health


                                                            5




  • On-line archives of document collections require better                                              6




    organization. Manual organization is not practical.
  • Our goal: To discover the hidden thematic structure with
    hierarchical probabilistic models called topic models.
  • Use this structure for browsing, search, and similarity.

                                          D. Blei          Modeling Science                                  2 / 53
Modeling Science
          Science, August 13, 1886 Science, June 24, 1994



                                                                 evolution      rna          disease
                       water       acid          disease         evolutionary   mrna         host
                       milk        water         blood           species        site         bacteria
                       food        solution      cholera         organisms      splicing     diseases
                       dry         experiments   bacteria        biology        rnas         new
                       fed         liquid        found           phylogenetic   nuclear      bacterial
                       cows        chemical      bacillus        life           sequence     resistance
                       houses      action        experiments     origin         introns      control
                       butter      copper        organisms       diversity      messenger    strains
                       fat         crystals      bacilli
                                                                 groups         cleavage     infectious
                       found       carbon        cases
                       made        alcohol       diseases
                                                                 molecular      two          malaria
                       contained   made          germs           animals        splice       parasites
                       wells       obtained      animal          two            sequences    parasite
                       produced    substances    koch            new            polymerase   tuberculosis
                       poisonous   nitrogen      made            living         intron       health


                                                           5




  • Our data are the pages Science from 1880-2002 (from JSTOR)                                          6




  • No reliable punctuation, meta-data, or references.
  • Note: this is just a subset of JSTOR’s archive.




                                         D. Blei          Modeling Science                                  2 / 53
Discover topics from a corpus

       “Genetics”    “Evolution” “Disease”                “Computers”
           human        evolution     disease               computer
          genome      evolutionary     host                   models
             dna         species     bacteria              information
          genetic      organisms     diseases                  data
            genes           life    resistance              computers
         sequence         origin     bacterial                system
            gene         biology        new                  network
         molecular       groups       strains                systems
        sequencing   phylogenetic     control                  model
            map           living    infectious                parallel
       information      diversity     malaria                methods
          genetics        group      parasite                networks
          mapping          new       parasites               software
          project           two       united                    new
         sequences      common     tuberculosis            simulations




                             D. Blei   Modeling Science                  3 / 53
Model the evolution of topics over time

                       "Theoretical Physics"                                                             "Neuroscience"


  FORCE                                                                                                                            OXYGEN
   o o o                   o                                      LASER                                                            o
           o           o                                                                                                               o
                   o           o                                  o o o                     NERVE                              o
               o                                                        o o                                                                                                o o
                                   o                          o                      o                                                                                 o
                   o o                                                                       o o o o o
                       o                                                                               o o
                   o o o                                                                                         o         o                               o       o
     RELATIVITY o          o o
                                 o
                                     o
                                                                                                                   o
                                                                                                                       o                   o           o       o
                 o     o       o o o o o                                                                             o
                                                                                                                   o o
               o                   o     o o                                                                     o
             o           o                                                                                             o                           o       NEURON
                               o                                                                             o             o
                                                                                                         o                                     o
                   o                          o                                                      o                         o               o
                                                  o
                                                  o                                              o o
               o                                                                             o o                                   o     o
                                              o       o                                                                                o           o
           o                                                                                                                           o o
       o                     o                            o
   o o                   o o                                  o                                                              o o               o
                     o o                                          o                                                      o o                         o
                                                                                                                                                   o o
   o o o o o o o o o                                                  o                                              o                                 o o o
                                                                          o                                o o                                           o o o
                                                                              o                      o o o                                                     o o
                                                                                  o o        o o o o                                                         o
                                                                                                                                                               o o


  1880     1900            1920        1940       1960            1980             2000     1880   1900          1920      1940            1960            1980            2000




                                                                                  D. Blei   Modeling Science                                                                      4 / 53
Model connections between topics
                                                                                                                                                              neurons
                                                                                                                               brain                          stimulus
                                                                                                                                                                motor
                                                                                                                              memory
                                                                                                                                                                visual
                                  activated                                                                                   subjects                                           synapses
                          tyrosine phosphorylation                                                                                                             cortical
                                                                                                                                 left                                                ltp
                                  activation
                              phosphorylation                   p53                                                             task            surface                          glutamate
                                   kinase                    cell cycle                 proteins                                                   tip                            synaptic
                                                              activity                   protein
                                                               cyclin                   binding              rna                                 image                            neurons
                                                            regulation                  domain               dna                                sample            materials
                                                                                                                                computer
                                                                                        domains        rna polymerase                                              organic
                                                                                                                                 problem        device
                                         receptor                                                         cleavage
                                                                                                                              information
                                                                                                                                                                   polymer
                    science                                               amino acids
   research
                   scientists
                                        receptors                             cdna
                                                                                                             site
                                                                                                                               computers
                                                                                                                                                                  polymers
    funding                                                                                                                                                       molecules      physicists
    support          says                  ligand                          sequence                                             problems
                                                                                                                                              laser                               particles
      nih          research               ligands                           isolated                                                         optical                              physics
   program          people                                                   protein                     sequence                              light
                                        apoptosis                                                       sequences         surface
                                                                                                                                                                                  particle
                                                                                                                                            electrons                           experiment
                                                                                                          genome           liquid           quantum
                                                           wild type                                        dna          surfaces                                                                 stars
                                                            mutant                       enzyme         sequencing          fluid
                                                           mutations                    enzymes                            model                       reaction                               astronomers
   united states                                           mutants
                                                                                           iron
                                                                                        active site
                                                                                                                                                      reactions                                 universe
       women                                cells
                                                           mutation                     reduction                                                     molecule                                  galaxies
    universities
                                             cell                                                                                                     molecules
                                         expression                                                               magnetic
                                                                                                                                                                                                 galaxy
                                          cell lines                                        plants
                                                                                                                 magnetic field                     transition state
      students                          bone marrow                                          plant
                                                                                                                     spin
                                                                                                               superconductivity
                                                                                            gene
     education                                                                              genes
                                                                                                               superconducting
                                                                                                                                                 pressure                    mantle
                                                                                         arabidopsis
                                                      bacteria                                                                                high pressure                   crust                   sun
                                                      bacterial                                                                                 pressures                 upper mantle             solar wind
                                                        host                                                  fossil record                        core                    meteorites                earth
                                                     resistance                         development               birds                         inner core                   ratios                 planets
              mice                                    parasite                            embryos                fossils                                                                             planet
                                                                         gene                                  dinosaurs
            antigen                      virus                                           drosophila                                species
                                                                       disease                                    fossil
             t cells                       hiv                                             genes                                    forest
                                                                      mutations
           antigens                       aids                                           expression                                forests
                                                                       families                                                                                    earthquake                  co2
        immune response                infection                                                                                 populations
                                                                      mutation                                                                                     earthquakes               carbon
                                        viruses                                                                                  ecosystems
                                                                                                                                                                       fault             carbon dioxide
                                                                                                                     ancient                                         images                 methane
                           patients                                                             genetic               found
                           disease                          cells                             population             impact
                                                                                                                                                                       data                   water
                                                                                                                                                                                                               ozone
                          treatment                       proteins                            populations      million years ago        volcanic                                                           atmospheric
                            drugs                                                             differences             africa
                            clinical                    researchers                                                                     deposits                        climate
                                                                                                                                                                                                          measurements
                                                                                               variation                                                                                                   stratosphere
                                                          protein                                                                       magma                           ocean
                                                                                                                                        eruption                           ice                            concentrations
                                                           found                                                                       volcanism                       changes
                                                                                                                                                                   climate change



                                                                                           D. Blei             Modeling Science                                                                                            5 / 53
Outline



1 Introduction


2 Latent Dirichlet allocation


3 Dynamic topic models


4 Correlated topic models




                                D. Blei   Modeling Science   6 / 53
Outline



1 Introduction


2 Latent Dirichlet allocation


3 Dynamic topic models


4 Correlated topic models




                                D. Blei   Modeling Science   7 / 53
Probabilistic modeling



  1   Treat data as observations that arise from a generative
      probabilistic process that includes hidden variables
        • For documents, the hidden variables reflect the thematic
          structure of the collection.
  2   Infer the hidden structure using posterior inference
         • What are the topics that describe this collection?
  3   Situate new data into the estimated model.
         • How does this query or new document fit into the estimated
           topic structure?




                               D. Blei   Modeling Science           8 / 53
Intuition behind LDA




       Simple intuition: Documents exhibit multiple topics.

                           D. Blei   Modeling Science         9 / 53
Generative process




  • Cast these intuitions into a generative probabilistic process
  • Each document is a random mixture of corpus-wide topics
  • Each word is drawn from one of those topics


                             D. Blei   Modeling Science             10 / 53
Generative process




  • In reality, we only observe the documents
  • Our goal is to infer the underlying topic structure
      • What are the topics?
      • How are the documents divided according to those topics?

                              D. Blei   Modeling Science       10 / 53
Graphical models (Aside)


                 Y                                     Y
                                           ≡
                          ···                          Xn
            X1       X2              XN                      N

  • Nodes are random variables
  • Edges denote possible dependence
  • Observed variables are shaded
  • Plates denote replicated structure




                                D. Blei   Modeling Science       11 / 53
Graphical models (Aside)


                 Y                                        Y
                                            ≡
                           ···                          Xn
            X1       X2               XN                                  N

  • Structure of the graph defines the pattern of conditional
    dependence between the ensemble of random variables
  • E.g., this graph corresponds to

                                                      N
                     p(y, x1 , . . . , xN ) = p(y )           p(xn | y)
                                                      n=1


                                 D. Blei   Modeling Science                   11 / 53
Latent Dirichlet allocation

                      Per-word
     Dirichlet
                  topic assignment
     parameter


            Per-document       Observed                             Topic
         topic proportions       word                 Topics   hyperparameter




        α         θd    Zd,n    Wd,n                     βk         η
                                          N
                                               D               K


            Each piece of the structure is a random variable.

                               D. Blei    Modeling Science                      12 / 53
Latent Dirichlet allocation




         α         θd      Zd,n      Wd,n                      βk       η
                                               N
                                                     D              K


  1   Draw each topic βi ∼ Dir(η), for i ∈ {1, . . . , K }.
  2   For each document:
        1 Draw topic proportions θd ∼ Dir(α).
        2 For each word:
            1 Draw Zd,n ∼ Mult(θd ).
            2 Draw Wd,n ∼ Mult(βzd,n ).


                                  D. Blei   Modeling Science                13 / 53
Latent Dirichlet allocation




       α        θd      Zd,n      Wd,n                      βk       η
                                            N
                                                  D              K


  • From a collection of documents, infer
      • Per-word topic assignment zd,n
      • Per-document topic proportions θd
      • Per-corpus topic distributions βk
  • Use posterior expectations to perform the task at hand, e.g.,
    information retrieval, document similarity, etc.

                               D. Blei   Modeling Science                13 / 53
Latent Dirichlet allocation




       α        θd        Zd,n        Wd,n                           βk          η
                                                     N
                                                           D              K


  • Computing the posterior is intractable:

                                 N
                     p(θ | α)    n=1   p(zn | θ )p(wn | zn , β1:K )
                                N           K
                θ   p(θ | α)    n=1         z=1   p(zn | θ )p(wn | zn , β1:K )

  • Several approximation techniques have been developed.


                                  D. Blei         Modeling Science                   13 / 53
Latent Dirichlet allocation




       α        θd     Zd,n      Wd,n                      βk       η
                                           N
                                                 D              K


  • Mean field variational methods (Blei et al., 2001, 2003)
  • Expectation propagation (Minka and Lafferty, 2002)
  • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)
  • Collapsed variational inference (Teh et al., 2006)




                              D. Blei   Modeling Science                13 / 53
Example inference




  • Data: The OCR’ed collection of Science from 1990–2000
      • 17K documents
      • 11M words
      • 20K unique terms (stop words and rare words removed)
  • Model: 100-topic LDA model using variational inference.


                            D. Blei   Modeling Science         14 / 53
Example inference




                                                     0.4
                                                     0.3
                                       Probability

                                                     0.2
                                                     0.1
                                                     0.0
                                                           1 8 16 26 36 46 56 66 76 86 96

                                                                        Topics




                    D. Blei   Modeling Science                                              15 / 53
Example topics

       “Genetics”    “Evolution” “Disease”                “Computers”
           human        evolution     disease               computer
          genome      evolutionary     host                   models
             dna         species     bacteria              information
          genetic      organisms     diseases                  data
            genes           life    resistance              computers
         sequence         origin     bacterial                system
            gene         biology        new                  network
         molecular       groups       strains                systems
        sequencing   phylogenetic     control                  model
            map           living    infectious                parallel
       information      diversity     malaria                methods
          genetics        group      parasite                networks
          mapping          new       parasites               software
          project           two       united                    new
         sequences      common     tuberculosis            simulations




                             D. Blei   Modeling Science                  16 / 53
LDA summary




  • LDA is a powerful model for
      • Visualizing the hidden thematic structure in large corpora
      • Generalizing new data to fit into that structure
  • LDA is a mixed membership model (Erosheva, 2004) that builds
    on the work of Deerwester et al. (1990) and Hofmann (1999).
  • For document collections and other grouped data, this might be
    more appropriate than a simple finite mixture




                            D. Blei   Modeling Science               17 / 53
LDA summary


  • Modular : It can be embedded in more complicated models.
      •   E.g., syntax and semantics; authorship; word sense
  • General: The data generating distribution can be changed.
      •   E.g., images; social networks; population genetics data
  • Variational inference is fast; lets us to analyze large data sets.



  • See Blei et al., 2003 for details and a quantitative comparison.
  • Code to play with LDA is freely available on my web-site,
    http://www.cs.princeton.edu/∼blei.




                              D. Blei   Modeling Science                 18 / 53
LDA summary




  • But, LDA makes certain assumptions about the data.
  • When are they appropriate?




                           D. Blei   Modeling Science    19 / 53
Outline



1 Introduction


2 Latent Dirichlet allocation


3 Dynamic topic models


4 Correlated topic models




                                D. Blei   Modeling Science   20 / 53
LDA and exchangeability




       α         θd      Zd,n      Wd,n                      βk       η
                                             N
                                                   D              K


  • LDA assumes that documents are exchangeable.
  • I.e., their joint probability is invariant to permutation.
  • This is too restrictive.




                                D. Blei   Modeling Science                21 / 53
Documents are not exchangeable
                                                           "Infrared Reflectance in Leaf-Sitting
          "Instantaneous Photography" (1890)
                                                           Neotropical Frogs" (1977)




  • Documents about the same topic are not exchangeable.
  • Topics evolve over time.

                                         D. Blei   Modeling Science                               22 / 53
Dynamic topic model




  • Divide corpus into sequential slices (e.g., by year).
  • Assume each slice’s documents exchangeable.
      •   Drawn from an LDA model.
  • Allow topic distributions evolve from slice to slice.




                               D. Blei   Modeling Science   23 / 53
Dynamic topic models

         α                α                                       α


         θd               θd                                      θd


      Zd,n               Zd,n                                    Zd,n


      Wd,n               Wd,n                                    Wd,n

                     N                    N                                   N
                     D                    D                                   D

                                                     ...
              βk,1             βk,2                                    βk,T
     K

                                D. Blei       Modeling Science                    24 / 53
Modeling evolving topics


           βk,1                βk,2                                   βk,T
                                                 ...

  • Use a logistic normal distribution to model evolving topics
    (Aitchison, 1980)
  • A state-space model on the natural parameter of the topic
    multinomial (West and Harrison, 1997)

         βt,k | βt−1,k   ∼ N (βt−1,k , Iσ 2 )
                                                             V −1
          p(w | βt,k ) = exp βt,k − log(1 +                  v =1   exp{βt,k ,v })




                                D. Blei   Modeling Science                           25 / 53
Posterior inference



  • Our goal is to compute the posterior distribution,

                    p(β1:T ,1:K , θ1:T ,1:D , z1:T ,1:D | w1:T ,1:D ).

  • Exact inference is impossible
      •   Per-document mixed-membership model
      •   Non-conjugacy between p(w | βt,k ) and p(βt,k )
  • MCMC is not practical for the amount of data.
  • Solution: Variational inference




                                  D. Blei   Modeling Science             26 / 53
Science data

                                       TECHVIEW: DNA S E Q U E   N C I NG


                                          Sequencing the Genome, Fast

                                       James C. Mullikin and Amanda A. McMurray



                                       Genome sequencing projects reveal
                                             the genetic makeup of an organism
                                             by reading off the sequence of the
                                       DNA bases, which encodes all of the infor-
                                       mation necessary for the life of the organ-
                                       ism. The base sequence contains four nu-
                                       cleotides-adenine, thymidine, guanosine,
                                       and cytosine-which are linked together
                                       into long double-helical chains. Over the
                                       last two decades, automated DNA se-
                                       quencers have made the process of obtain-
                                       ing the base-by-base sequence of DNA...




  • Analyze JSTOR’s entire collection from Science (1880-2002)
  • Restrict to 30K terms that occur more than ten times
  • The data are 76M words in 130K documents


                            D. Blei   Modeling Science                               27 / 53
Analyzing a document

      Original article                          Topic proportions




                         D. Blei   Modeling Science                 28 / 53
Analyzing a document

      Original article             Most likely words from top topics


                                   sequence              devices      data
                                   genome                device       information
                                   genes                 materials    network
                                   sequences             current      web
                                   human                 high         computer
                                   gene                  gate         language
                                   dna                   light        networks
                                   sequencing            silicon      time
                                   chromosome            material     software
                                   regions               technology   system
                                   analysis              electrical   words
                                   data                  fiber         algorithm
                                   genomic               power        number
                                   number                based        internet




                         D. Blei      Modeling Science                          28 / 53
Analyzing a topic

    1880             1890             1900                1910              1920               1930              1940
   electric         electric       apparatus               air           apparatus             tube               air
  machine           power            steam               water              tube            apparatus            tube
   power           company           power            engineering            air               glass          apparatus
   engine           steam           engine             apparatus          pressure              air              glass
   steam           electrical     engineering             room              water            mercury          laboratory
     two           machine           water             laboratory           glass           laboratory          rubber
  machines            two         construction          engineer             gas             pressure          pressure
     iron           system         engineer              made              made               made               small
   battery           motor           room                  gas           laboratory             gas            mercury
    wire            engine            feet                tube            mercury              small              gas



         1950               1960            1970                1980             1990                2000
         tube               tube             air                high          materials            devices
      apparatus            system           heat              power              high              device
         glass          temperature        power              design            power             materials
           air                air          system               heat           current             current
       chamber              heat        temperature           system         applications            gate
      instrument          chamber         chamber            systems         technology              high
         small             power            high             devices           devices               light
      laboratory            high            flow            instruments         design               silicon
       pressure          instrument         tube              control          device              material
        rubber             control         design              large             heat            technology




                                                 D. Blei     Modeling Science                                         29 / 53
Visualizing trends within a topic

                       "Theoretical Physics"                                                             "Neuroscience"


  FORCE                                                                                                                            OXYGEN
   o o o                   o                                      LASER                                                            o
           o           o                                                                                                               o
                   o           o                                  o o o                     NERVE                              o
               o                                                        o o                                                                                                o o
                                   o                          o                      o                                                                                 o
                   o o                                                                       o o o o o
                       o                                                                               o o
                   o o o                                                                                         o         o                               o       o
     RELATIVITY o          o o
                                 o
                                     o
                                                                                                                   o
                                                                                                                       o                   o           o       o
                 o     o       o o o o o                                                                             o
                                                                                                                   o o
               o                   o     o o                                                                     o
             o           o                                                                                             o                           o       NEURON
                               o                                                                             o             o
                                                                                                         o                                     o
                   o                          o                                                      o                         o               o
                                                  o
                                                  o                                              o o
               o                                                                             o o                                   o     o
                                              o       o                                                                                o           o
           o                                                                                                                           o o
       o                     o                            o
   o o                   o o                                  o                                                              o o               o
                     o o                                          o                                                      o o                         o
                                                                                                                                                   o o
   o o o o o o o o o                                                  o                                              o                                 o o o
                                                                          o                                o o                                           o o o
                                                                              o                      o o o                                                     o o
                                                                                  o o        o o o o                                                         o
                                                                                                                                                               o o


  1880     1900            1920        1940       1960            1980             2000     1880   1900          1920      1940            1960            1980            2000




                                                                                  D. Blei   Modeling Science                                                                     30 / 53
Time-corrected document similarity


  • Consider the expected Hellinger distance between the topic
    proportions of two documents,
                              K
                   dij = E          ( θi,k −    θj,k )2 | wi , wj
                             k =1

  • Uses the latent structure to define similarity
  • Time has been factored out because the topics associated to the
    components are different from year to year.
  • Similarity based only on topic proportions




                               D. Blei   Modeling Science           31 / 53
Time-corrected document similarity

               The Brain of the Orang (1880)




                        D. Blei   Modeling Science   32 / 53
Time-corrected document similarity

     Representation of the Visual Field on the Medial Wall of
       Occipital-Parietal Cortex in the Owl Monkey (1976)




                           D. Blei   Modeling Science           33 / 53
Browser of Science




                     D. Blei   Modeling Science   34 / 53
Quantitative comparison




  • Compute the probability of each year’s documents conditional on
    all the previous year’s documents,

                          p(wt | w1 , . . . , wt−1 )

  • Compare exchangeable and dynamic topic models




                            D. Blei   Modeling Science          35 / 53
Quantitative comparison


                                                 q



                                                                                                                                 LDA
                                            25
                                                                                                                                 DTM
         Per−word negative log likelihood

                                            20




                                                     q
                                            15




                                                              q
                                                         q



                                                                                           q

                                                                            q
                                            10




                                                                    q                                     q   q
                                                              q         q
                                                                                 q                                                   q
                                                                                                    q             q    q
                                                                                               q                             q   q
                                                 q   q                                 q                      q                           q
                                                                            q                             q       q
                                                         q                                 q                                         q
                                                                    q   q                                                        q        q
                                                                                 q     q       q                       q     q
                                                                                                    q



                                                             1920               1940               1960               1980               2000

                                                                                           Year


                                                                            D. Blei            Modeling Science                                 36 / 53
Outline



1 Introduction


2 Latent Dirichlet allocation


3 Dynamic topic models


4 Correlated topic models




                                D. Blei   Modeling Science   37 / 53
The hidden assumptions of the Dirichlet distribution




  • The Dirichlet is an exponential family distribution on the simplex,
    positive vectors that sum to one.
  • However, the near independence of components makes it a poor
    choice for modeling topic proportions.
  • An article about fossil fuels is more likely to also be about
    geology than about genetics.

                              D. Blei   Modeling Science             38 / 53
The logistic normal distribution




  • The logistic normal is a distribution on the simplex that can
    model dependence between components.
  • The natural parameters of the multinomial are drawn from a
    multivariate Gaussian distribution.
                 X    ∼ NK −1 (µ, )
                                                    K −1
                 θi   = exp{xi − log(1 +            j=1    exp{xj })}
                              D. Blei   Modeling Science                39 / 53
Correlated topic model (CTM)




          Σ                                                 βk
                  ηd     Zd,n          Wd,n                 K
                                               N
          µ                                            D

  • Draw topic proportions from a logistic normal, where topic
    occurrences can exhibit correlation.
  • Use for:
      • Providing a “map” of topics and how they are related
      • Better prediction via correlated topics



                             D. Blei     Modeling Science        40 / 53
neurons
                                                                                                                            brain                          stimulus
                                                                                                                                                             motor
                                                                                                                           memory
                                                                                                                                                             visual
                               activated                                                                                   subjects                                           synapses
                       tyrosine phosphorylation                                                                                                             cortical
                                                                                                                              left                                                ltp
                               activation
                           phosphorylation                   p53                                                             task            surface                          glutamate
                                kinase                    cell cycle                 proteins                                                   tip                            synaptic
                                                           activity                   protein
                                                            cyclin                   binding              rna                                 image                            neurons
                                                         regulation                  domain               dna                                sample            materials
                                                                                                                             computer
                                                                                     domains        rna polymerase                                              organic
                                                                                                                              problem        device
                                      receptor                                                         cleavage
                                                                                                                           information
                                                                                                                                                                polymer
                 science                                               amino acids
research
                scientists
                                     receptors                             cdna
                                                                                                          site
                                                                                                                            computers
                                                                                                                                                               polymers
 funding                                                                                                                                                       molecules      physicists
 support          says                  ligand                          sequence                                             problems
                                                                                                                                           laser                               particles
   nih          research               ligands                           isolated                                                         optical                              physics
program          people                                                   protein                     sequence                              light
                                     apoptosis                                                       sequences         surface
                                                                                                                                                                               particle
                                                                                                                                         electrons                           experiment
                                                                                                       genome           liquid           quantum
                                                        wild type                                        dna          surfaces                                                                 stars
                                                         mutant                       enzyme         sequencing          fluid
                                                        mutations                    enzymes                            model                       reaction                               astronomers
united states                                           mutants
                                                                                        iron
                                                                                     active site
                                                                                                                                                   reactions                                 universe
    women                                cells
                                                        mutation                     reduction                                                     molecule                                  galaxies
 universities
                                          cell                                                                                                     molecules
                                      expression                                                               magnetic
                                                                                                                                                                                              galaxy
                                       cell lines                                        plants
                                                                                                              magnetic field                     transition state
   students                          bone marrow                                          plant
                                                                                                                  spin
                                                                                                            superconductivity
                                                                                         gene
  education                                                                              genes
                                                                                                            superconducting
                                                                                                                                              pressure                    mantle
                                                                                      arabidopsis
                                                   bacteria                                                                                high pressure                   crust                   sun
                                                   bacterial                                                                                 pressures                 upper mantle             solar wind
                                                     host                                                  fossil record                        core                    meteorites                earth
                                                  resistance                         development               birds                         inner core                   ratios                 planets
           mice                                    parasite                            embryos                fossils                                                                             planet
                                                                      gene                                  dinosaurs
         antigen                      virus                                           drosophila                                species
                                                                    disease                                    fossil
          t cells                       hiv                                             genes                                    forest
                                                                   mutations
        antigens                       aids                                           expression                                forests
                                                                    families                                                                                    earthquake                  co2
     immune response                infection                                                                                 populations
                                                                   mutation                                                                                     earthquakes               carbon
                                     viruses                                                                                  ecosystems
                                                                                                                                                                    fault             carbon dioxide
                                                                                                                  ancient                                         images                 methane
                        patients                                                             genetic               found
                        disease                          cells                             population             impact
                                                                                                                                                                    data                   water
                                                                                                                                                                                                            ozone
                       treatment                       proteins                            populations      million years ago        volcanic                                                           atmospheric
                         drugs                                                             differences             africa
                         clinical                    researchers                                                                     deposits                        climate
                                                                                                                                                                                                       measurements
                                                                                            variation                                                                                                   stratosphere
                                                       protein                                                                       magma                           ocean
                                                                                                                                     eruption                           ice                            concentrations
                                                        found                                                                       volcanism                       changes
                                                                                                                                                                climate change




                                                                                         D. Blei            Modeling Science                                                                                     41 / 53
Summary


 • Topic models provide useful descriptive statistics for analyzing
   and understanding the latent structure of large text collections.
 • Probabilistic graphical models are a useful way to express
   assumptions about the hidden structure of complicated data.
 • Variational methods allow us to perform posterior inference to
   automatically infer that structure from large data sets.
 • Current research
     • Choosing the number of topics
     • Continuous time dynamic topic models
     • Topic models for prediction
     • Inferring the impact of a document




                             D. Blei   Modeling Science               42 / 53
“We should seek out unfamiliar summaries of observational material,
and establish their useful properties... And still more novelty can
come from finding, and evading, still deeper lying constraints.”
(John Tukey, The Future of Data Analysis, 1962)




                            D. Blei   Modeling Science           43 / 53
Supervised topic models (with Jon McAuliffe)



  • Most topic models are unsupervised. They are fit by maximizing
    the likelihood of a collection of documents.
  • Consider documents paired with response variables.
    For example:
      • Movie reviews paired with a number of stars
      • Web pages paired with a number of “diggs”
  • We develop supervised topic models, models of documents and
    responses that are fit to find topics predictive of the response.




                             D. Blei   Modeling Science               44 / 53
Supervised LDA



               α       θd     Zd,n        Wd,n                 βk K
                                                  N



                                          Yd           D      η, σ 2

  1   Draw topic proportions θ | α ∼ Dir(α).
  2   For each word
        1 Draw topic assignment zn | θ ∼ Mult(θ ).
        2 Draw word wn | zn , β1:K ∼ Mult(βzn ).
  3   Draw response variable y | z1:N , η, σ 2 ∼ N η z, σ 2 , where
                                                     ¯
                                                 N
                             z = (1/N)
                             ¯                   n=1   zn .
                                D. Blei    Modeling Science            45 / 53
Comments


 • SLDA is used as follows.
     • Fit coefficients and topics from a collection of
       document-response pairs.
     • Use the fitted model to predict the responses of previously
       unseen documents,

              E[Y | w1:N , α, β1:K , η, σ 2 ] = η E[Z | w1:N , α, β1:K ].
                                                    ¯

 • The process enforces that the document is generated first,
   followed by the response. The response is generated from the
   particular topics that were realized in generating the document.




                               D. Blei   Modeling Science                   46 / 53
Example: Movie reviews
             least                 bad                        more         awful            his                       both
             problem               guys                       has          featuring        their                     motion
             unfortunately         watchable                  than         routine          character                 simple
             supposed              its                        films         dry              many                      perfect
             worse                 not                        director     offered          while                     fascinating
             flat                   one                        will         charlie          performance               power
             dull                  movie                      characters   paris            between                   complex
             ●                       ●                    ●                ●●       ●         ●           ●              ●


       −30                   −20                  −10                           0                                10                 20
                                                         have        not            one              however
                                                         like        about          from             cinematography
                                                         you         movie          there            screenplay
                                                         was         all            which            performances
                                                         just        would          who              pictures
                                                         some        they           much             effective
                                                         out         its            what             picture




  • We fit a 10-topic sLDA model to movie review data (Pang and
    Lee, 2005).
      • The documents are the words of the reviews.
      • The responses are the number of stars associated with
        each review (modeled as continuous).
  • Each component of coefficient vector η is associated with a topic.



                                               D. Blei   Modeling Science                                                                47 / 53
Simulations
                                                                                       Movie corpus




                              0.5
                                                               ●   ●          ●    ●    ●
                                                     ●                  ●                   ●




                                                                                                                                    −6.37
                                                                                                                                                  ●

                                                                                                                                                       ●         ●
                                                 ●                                                                                          ●




                                                                                                 Per−word held out log likelihood
                              0.4




                                                                                                                                    −6.38
                                                                                                                                                  ●    ●             ●
                                                                                                                                            ●
                                                                                                                                                                          ●
                                                                                                                                                                 ●




              Predictive R2
                                                                                                                                                                                   ●




                              0.3




                                                                                                                                    −6.39
                                                                                                                                                                                       ●
                                                                                                                                                                     ●
                                                                                            ●
                                                                                        ●                                                                                                   ●
                                     ●               ●             ●    ●     ●    ●
                                                               ●                                                                                                                                 ●
                                                                                                                                                                          ●




                                                                                                                                    −6.40
                              0.2
                                                 ●
                                                                                                                                                                                   ●

                                     ●                                                                                                                                                 ●




                                                                                                                                    −6.41
                              0.1



                                                                                                                                                                                            ●
                                                                                       sLDA
                                                                                       LDA




                                                                                                                                    −6.42
                                                                                                                                                                                                 ●
                              0.0




                                     5       10      15       20   25   30   35   40   45   50                                              5     10   15       20   25   30   35      40   45   50

                                                               Number of topics                                                                                  Number of topics

                                                                                            Digg corpus




                                                                                                                                    −8.0
                              0.12




                                                                                                                                    −8.1
                              0.10




                                                                                                 Per−word held out log likelihood
                                     ●




                                                                                                                                    −8.2
                                                                                                                                              ●
                              0.08




                                                                                                                                            ● ●
                                                                                                                                            ●   ● ●
                                                                                                                                                  ●
              Predictive R2




                                                                                                                                    −8.3
                                         ●
                                                                                                                                                            ●
                                                                                                                                                            ●
                              0.06




                                             ●
                                                 ●
                                                                                                                                    −8.4

                                                                                                                                                                               ●
                                                                                                                                                                               ●
                              0.04




                                                                                                                                                                                                 ●
                                                                                                                                                                                                 ●
                                                                                                                                    −8.5
                              0.02




                                                          ●
                                                                             ●              ●
                                                                                                                                    −8.6




                                                                                            ●
                                                 ●
                                                                             ●
                              0.00




                                         ● ●
                                                          ●
                                     ●



                                     2 4                  10                 20             30                                              2 4             10                 20                30

                                                               Number of topics                                                                                  Number of topics



                                                                                  D. Blei                                           Modeling Science                                                  48 / 53
Diversion: Variational inference




  • Let x1:N be observations and z1:M be latent variables
  • Our goal is to compute the posterior distribution

                                         p(z1:M , x1:N )
                   p(z1:M | x1:N ) =
                                        p(z1:M , x1:N )dz1:M

  • For many interesting distributions, the marginal likelihood of the
    observations is difficult to efficiently compute




                              D. Blei   Modeling Science             49 / 53
Variational inference



  • Use Jensen’s inequality to bound the log prob of the
    observations:

          log p(x1:N ) ≥ Eqν [log p(z1:M , x1:N )] − Eqν [log qν (z1:M )].

  • We have introduced a distribution of the latent variables with free
    variational parameters ν.
  • We optimize those parameters to tighten this bound.
  • This is the same as finding the member of the family qν that is
    closest in KL divergence to p(z1:M | x1:N ).




                                D. Blei   Modeling Science                   50 / 53
Mean-field variational inference


  • Complexity of optimization is determined by factorization of qν
  • In mean field variational inference qν is fully factored

                                          M
                          qν (z1:M ) =         qνm (zm ).
                                         m=1

  • The latent variables are independent.
      •   Each is governed by its own variational parameter νm .
  • In the true posterior they can exhibit dependence
    (often, this is what makes exact inference difficult).




                              D. Blei    Modeling Science             51 / 53
MFVI and conditional exponential families



  • Suppose the distribution of each latent variable conditional on
    the observations and other latent variables is in the exponential
    family:

      p(zm | z−m , x) = hm (zm ) exp{gm (z−m , x)T zm − am (gi (z−m , x))}

  • Assume qν is fully factorized and each factor is in the same
    exponential family:

                  qνm (zm ) = hm (zm ) exp{νm zm − am (νm )}
                                            T




                               D. Blei   Modeling Science                52 / 53
MFVI and conditional exponential families



  • Variational inference is the following coordinate ascent algorithm

                          νm = Eqν [gm (Z−m , x)]

  • Notice the relationship to Gibbs sampling




                             D. Blei   Modeling Science            52 / 53
Variational family for the DTM

             βk,1               βk,2                           βk,T
                                                   ...


             ˆ
             βk,1               ˆ
                                βk,2                           ˆ
                                                               βk,T


  • Distribution of θ and z is fully-factorized (Blei et al., 2003)
  • Distribution of {β1,k , . . . , βT ,k } is a variational Kalman filter
  • Gaussian state-space model with free observations βk ,t .
                                                      ˆ
  • Fit observations such that the corresponding posterior over the
    chain is close to the true posterior.

                                  D. Blei   Modeling Science                53 / 53
Variational family for the DTM

            βk,1             βk,2                           βk,T
                                                ...


            ˆ
            βk,1             ˆ
                             βk,2                           ˆ
                                                            βk,T


  • Given a document collection, use coordinate ascent on all the
    variational parameters until the KL converges.
  • Yields a distribution close to the true posterior of interest
  • Take expectations w/r/t the simpler variational distribution



                               D. Blei   Modeling Science           53 / 53

Weitere ähnliche Inhalte

Was ist angesagt?

Pace.indoor air2011
Pace.indoor air2011Pace.indoor air2011
Pace.indoor air2011nrpace
 
Talk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" project
Talk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" projectTalk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" project
Talk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" projectJonathan Eisen
 
Classification of microorganism
Classification of microorganismClassification of microorganism
Classification of microorganismSujit Kakade
 
Classification of Enterobacteriaceae family
Classification of Enterobacteriaceae familyClassification of Enterobacteriaceae family
Classification of Enterobacteriaceae familyAbhijit Chaudhury
 
Mici 1100 sept_08_lectures_1-5
Mici 1100 sept_08_lectures_1-5Mici 1100 sept_08_lectures_1-5
Mici 1100 sept_08_lectures_1-5Star Reddy
 
Criteria for classification of microbes
Criteria for classification of microbesCriteria for classification of microbes
Criteria for classification of microbesDr. sreeremya S
 
Uncovering the impacts of circumcision on the penis microbiome, Translational...
Uncovering the impacts of circumcision on the penis microbiome, Translational...Uncovering the impacts of circumcision on the penis microbiome, Translational...
Uncovering the impacts of circumcision on the penis microbiome, Translational...Copenhagenomics
 
[Micro] classification of prokaryotes
[Micro] classification of prokaryotes[Micro] classification of prokaryotes
[Micro] classification of prokaryotesMuhammad Ahmad
 
Module 7 bacilli
Module 7   bacilliModule 7   bacilli
Module 7 bacilliEhsan Lee
 
A phylogeny driven genomic encyclopedia of bacteria and archaea
A phylogeny driven genomic encyclopedia of bacteria and archaeaA phylogeny driven genomic encyclopedia of bacteria and archaea
A phylogeny driven genomic encyclopedia of bacteria and archaeaJonathan Eisen
 
Culture independent methods for detection & enumeration of gut microflora
Culture independent methods for detection & enumeration of gut microfloraCulture independent methods for detection & enumeration of gut microflora
Culture independent methods for detection & enumeration of gut microfloraAmna Jalil
 
Bacteriology - Microbiology
Bacteriology - MicrobiologyBacteriology - Microbiology
Bacteriology - MicrobiologyMBBS Help
 
Hépatite B.pdf
Hépatite B.pdfHépatite B.pdf
Hépatite B.pdfodeckmyn
 

Was ist angesagt? (18)

Pace.indoor air2011
Pace.indoor air2011Pace.indoor air2011
Pace.indoor air2011
 
Virology techniques
Virology techniquesVirology techniques
Virology techniques
 
Talk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" project
Talk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" projectTalk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" project
Talk by J. Eisen at ASBMB on "Phylogeny driven genomic encyclopedia" project
 
Classification of microorganism
Classification of microorganismClassification of microorganism
Classification of microorganism
 
Classification of Enterobacteriaceae family
Classification of Enterobacteriaceae familyClassification of Enterobacteriaceae family
Classification of Enterobacteriaceae family
 
Micro organisms
Micro organismsMicro organisms
Micro organisms
 
Mici 1100 sept_08_lectures_1-5
Mici 1100 sept_08_lectures_1-5Mici 1100 sept_08_lectures_1-5
Mici 1100 sept_08_lectures_1-5
 
Criteria for classification of microbes
Criteria for classification of microbesCriteria for classification of microbes
Criteria for classification of microbes
 
Uncovering the impacts of circumcision on the penis microbiome, Translational...
Uncovering the impacts of circumcision on the penis microbiome, Translational...Uncovering the impacts of circumcision on the penis microbiome, Translational...
Uncovering the impacts of circumcision on the penis microbiome, Translational...
 
[Micro] classification of prokaryotes
[Micro] classification of prokaryotes[Micro] classification of prokaryotes
[Micro] classification of prokaryotes
 
Module 7 bacilli
Module 7   bacilliModule 7   bacilli
Module 7 bacilli
 
A phylogeny driven genomic encyclopedia of bacteria and archaea
A phylogeny driven genomic encyclopedia of bacteria and archaeaA phylogeny driven genomic encyclopedia of bacteria and archaea
A phylogeny driven genomic encyclopedia of bacteria and archaea
 
Culture independent methods for detection & enumeration of gut microflora
Culture independent methods for detection & enumeration of gut microfloraCulture independent methods for detection & enumeration of gut microflora
Culture independent methods for detection & enumeration of gut microflora
 
Bacteriology - Microbiology
Bacteriology - MicrobiologyBacteriology - Microbiology
Bacteriology - Microbiology
 
Bacillus1
Bacillus1Bacillus1
Bacillus1
 
Discovery of bacteriophage
Discovery of bacteriophageDiscovery of bacteriophage
Discovery of bacteriophage
 
bacteriophage
bacteriophage bacteriophage
bacteriophage
 
Hépatite B.pdf
Hépatite B.pdfHépatite B.pdf
Hépatite B.pdf
 

Ähnlich wie Modeling science

1a 1. science done
1a 1. science   done1a 1. science   done
1a 1. science donebiobuddy
 
C:\Fakepath\ Start Here Ch01 Lecture
C:\Fakepath\ Start Here Ch01 LectureC:\Fakepath\ Start Here Ch01 Lecture
C:\Fakepath\ Start Here Ch01 LectureDebra Costa-Nino
 
1 introduction to microbiology
1 introduction to microbiology1 introduction to microbiology
1 introduction to microbiologyUmair hanif
 
45 ch48immunity2009
45 ch48immunity200945 ch48immunity2009
45 ch48immunity2009sbarkanic
 
Microbio.pptx
Microbio.pptxMicrobio.pptx
Microbio.pptxrnath286
 
Introduction to Genetics
Introduction to GeneticsIntroduction to Genetics
Introduction to GeneticsCEU
 
Introduction to Genetics
Introduction to GeneticsIntroduction to Genetics
Introduction to GeneticsCEU
 
History of genetics
History of geneticsHistory of genetics
History of geneticsannaaquino21
 
60 ch14dn ahistory2008
60 ch14dn ahistory200860 ch14dn ahistory2008
60 ch14dn ahistory2008sbarkanic
 
Introduction to microbiology 081210 fv
Introduction to microbiology 081210 fvIntroduction to microbiology 081210 fv
Introduction to microbiology 081210 fvMuhammedibrahim48
 
Eisen Talk for MBL Microbial Diversity Course
Eisen Talk for MBL Microbial Diversity CourseEisen Talk for MBL Microbial Diversity Course
Eisen Talk for MBL Microbial Diversity CourseJonathan Eisen
 
67 biotechnology2008 3
67 biotechnology2008 367 biotechnology2008 3
67 biotechnology2008 3sbarkanic
 
Lab Report Of The Experiment Of Conjugation Of E. Coli
Lab Report Of The Experiment Of Conjugation Of E. ColiLab Report Of The Experiment Of Conjugation Of E. Coli
Lab Report Of The Experiment Of Conjugation Of E. ColiRenee Wardowski
 

Ähnlich wie Modeling science (20)

1a 1. science done
1a 1. science   done1a 1. science   done
1a 1. science done
 
C:\Fakepath\ Start Here Ch01 Lecture
C:\Fakepath\ Start Here Ch01 LectureC:\Fakepath\ Start Here Ch01 Lecture
C:\Fakepath\ Start Here Ch01 Lecture
 
1 introduction to microbiology
1 introduction to microbiology1 introduction to microbiology
1 introduction to microbiology
 
45 ch48immunity2009
45 ch48immunity200945 ch48immunity2009
45 ch48immunity2009
 
Microbio.pptx
Microbio.pptxMicrobio.pptx
Microbio.pptx
 
Molecular homology
Molecular homologyMolecular homology
Molecular homology
 
Introduction to Genetics
Introduction to GeneticsIntroduction to Genetics
Introduction to Genetics
 
On The Origin Of Immune System
On The  Origin Of  Immune  SystemOn The  Origin Of  Immune  System
On The Origin Of Immune System
 
Bio presentation
Bio presentationBio presentation
Bio presentation
 
Introduction to Genetics
Introduction to GeneticsIntroduction to Genetics
Introduction to Genetics
 
Nwabr ethics jan 11
Nwabr ethics jan 11Nwabr ethics jan 11
Nwabr ethics jan 11
 
History of genetics
History of geneticsHistory of genetics
History of genetics
 
12.1 notes
12.1 notes12.1 notes
12.1 notes
 
60 ch14dn ahistory2008
60 ch14dn ahistory200860 ch14dn ahistory2008
60 ch14dn ahistory2008
 
Introduction to microbiology 081210 fv
Introduction to microbiology 081210 fvIntroduction to microbiology 081210 fv
Introduction to microbiology 081210 fv
 
Miledna All
Miledna AllMiledna All
Miledna All
 
Eisen Talk for MBL Microbial Diversity Course
Eisen Talk for MBL Microbial Diversity CourseEisen Talk for MBL Microbial Diversity Course
Eisen Talk for MBL Microbial Diversity Course
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
67 biotechnology2008 3
67 biotechnology2008 367 biotechnology2008 3
67 biotechnology2008 3
 
Lab Report Of The Experiment Of Conjugation Of E. Coli
Lab Report Of The Experiment Of Conjugation Of E. ColiLab Report Of The Experiment Of Conjugation Of E. Coli
Lab Report Of The Experiment Of Conjugation Of E. Coli
 

Mehr von Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 

Mehr von Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 

Kürzlich hochgeladen

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Kürzlich hochgeladen (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Modeling science

  • 1. Modeling Science David M. Blei Department of Computer Science Princeton University April 17, 2008 Joint work with John Lafferty (CMU) D. Blei Modeling Science 1 / 53
  • 2. Modeling Science Science, August 13, 1886 Science, June 24, 1994 evolution rna disease water acid disease evolutionary mrna host milk water blood species site bacteria food solution cholera organisms splicing diseases dry experiments bacteria biology rnas new fed liquid found phylogenetic nuclear bacterial cows chemical bacillus life sequence resistance houses action experiments origin introns control butter copper organisms diversity messenger strains fat crystals bacilli groups cleavage infectious found carbon cases made alcohol diseases molecular two malaria contained made germs animals splice parasites wells obtained animal two sequences parasite produced substances koch new polymerase tuberculosis poisonous nitrogen made living intron health 5 • On-line archives of document collections require better 6 organization. Manual organization is not practical. • Our goal: To discover the hidden thematic structure with hierarchical probabilistic models called topic models. • Use this structure for browsing, search, and similarity. D. Blei Modeling Science 2 / 53
  • 3. Modeling Science Science, August 13, 1886 Science, June 24, 1994 evolution rna disease water acid disease evolutionary mrna host milk water blood species site bacteria food solution cholera organisms splicing diseases dry experiments bacteria biology rnas new fed liquid found phylogenetic nuclear bacterial cows chemical bacillus life sequence resistance houses action experiments origin introns control butter copper organisms diversity messenger strains fat crystals bacilli groups cleavage infectious found carbon cases made alcohol diseases molecular two malaria contained made germs animals splice parasites wells obtained animal two sequences parasite produced substances koch new polymerase tuberculosis poisonous nitrogen made living intron health 5 • Our data are the pages Science from 1880-2002 (from JSTOR) 6 • No reliable punctuation, meta-data, or references. • Note: this is just a subset of JSTOR’s archive. D. Blei Modeling Science 2 / 53
  • 4. Discover topics from a corpus “Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations D. Blei Modeling Science 3 / 53
  • 5. Model the evolution of topics over time "Theoretical Physics" "Neuroscience" FORCE OXYGEN o o o o LASER o o o o o o o o o NERVE o o o o o o o o o o o o o o o o o o o o o o o o o o o RELATIVITY o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NEURON o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000 D. Blei Modeling Science 4 / 53
  • 6. Model connections between topics neurons brain stimulus motor memory visual activated subjects synapses tyrosine phosphorylation cortical left ltp activation phosphorylation p53 task surface glutamate kinase cell cycle proteins tip synaptic activity protein cyclin binding rna image neurons regulation domain dna sample materials computer domains rna polymerase organic problem device receptor cleavage information polymer science amino acids research scientists receptors cdna site computers polymers funding molecules physicists support says ligand sequence problems laser particles nih research ligands isolated optical physics program people protein sequence light apoptosis sequences surface particle electrons experiment genome liquid quantum wild type dna surfaces stars mutant enzyme sequencing fluid mutations enzymes model reaction astronomers united states mutants iron active site reactions universe women cells mutation reduction molecule galaxies universities cell molecules expression magnetic galaxy cell lines plants magnetic field transition state students bone marrow plant spin superconductivity gene education genes superconducting pressure mantle arabidopsis bacteria high pressure crust sun bacterial pressures upper mantle solar wind host fossil record core meteorites earth resistance development birds inner core ratios planets mice parasite embryos fossils planet gene dinosaurs antigen virus drosophila species disease fossil t cells hiv genes forest mutations antigens aids expression forests families earthquake co2 immune response infection populations mutation earthquakes carbon viruses ecosystems fault carbon dioxide ancient images methane patients genetic found disease cells population impact data water ozone treatment proteins populations million years ago volcanic atmospheric drugs differences africa clinical researchers deposits climate measurements variation stratosphere protein magma ocean eruption ice concentrations found volcanism changes climate change D. Blei Modeling Science 5 / 53
  • 7. Outline 1 Introduction 2 Latent Dirichlet allocation 3 Dynamic topic models 4 Correlated topic models D. Blei Modeling Science 6 / 53
  • 8. Outline 1 Introduction 2 Latent Dirichlet allocation 3 Dynamic topic models 4 Correlated topic models D. Blei Modeling Science 7 / 53
  • 9. Probabilistic modeling 1 Treat data as observations that arise from a generative probabilistic process that includes hidden variables • For documents, the hidden variables reflect the thematic structure of the collection. 2 Infer the hidden structure using posterior inference • What are the topics that describe this collection? 3 Situate new data into the estimated model. • How does this query or new document fit into the estimated topic structure? D. Blei Modeling Science 8 / 53
  • 10. Intuition behind LDA Simple intuition: Documents exhibit multiple topics. D. Blei Modeling Science 9 / 53
  • 11. Generative process • Cast these intuitions into a generative probabilistic process • Each document is a random mixture of corpus-wide topics • Each word is drawn from one of those topics D. Blei Modeling Science 10 / 53
  • 12. Generative process • In reality, we only observe the documents • Our goal is to infer the underlying topic structure • What are the topics? • How are the documents divided according to those topics? D. Blei Modeling Science 10 / 53
  • 13. Graphical models (Aside) Y Y ≡ ··· Xn X1 X2 XN N • Nodes are random variables • Edges denote possible dependence • Observed variables are shaded • Plates denote replicated structure D. Blei Modeling Science 11 / 53
  • 14. Graphical models (Aside) Y Y ≡ ··· Xn X1 X2 XN N • Structure of the graph defines the pattern of conditional dependence between the ensemble of random variables • E.g., this graph corresponds to N p(y, x1 , . . . , xN ) = p(y ) p(xn | y) n=1 D. Blei Modeling Science 11 / 53
  • 15. Latent Dirichlet allocation Per-word Dirichlet topic assignment parameter Per-document Observed Topic topic proportions word Topics hyperparameter α θd Zd,n Wd,n βk η N D K Each piece of the structure is a random variable. D. Blei Modeling Science 12 / 53
  • 16. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K 1 Draw each topic βi ∼ Dir(η), for i ∈ {1, . . . , K }. 2 For each document: 1 Draw topic proportions θd ∼ Dir(α). 2 For each word: 1 Draw Zd,n ∼ Mult(θd ). 2 Draw Wd,n ∼ Mult(βzd,n ). D. Blei Modeling Science 13 / 53
  • 17. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K • From a collection of documents, infer • Per-word topic assignment zd,n • Per-document topic proportions θd • Per-corpus topic distributions βk • Use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, etc. D. Blei Modeling Science 13 / 53
  • 18. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K • Computing the posterior is intractable: N p(θ | α) n=1 p(zn | θ )p(wn | zn , β1:K ) N K θ p(θ | α) n=1 z=1 p(zn | θ )p(wn | zn , β1:K ) • Several approximation techniques have been developed. D. Blei Modeling Science 13 / 53
  • 19. Latent Dirichlet allocation α θd Zd,n Wd,n βk η N D K • Mean field variational methods (Blei et al., 2001, 2003) • Expectation propagation (Minka and Lafferty, 2002) • Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) • Collapsed variational inference (Teh et al., 2006) D. Blei Modeling Science 13 / 53
  • 20. Example inference • Data: The OCR’ed collection of Science from 1990–2000 • 17K documents • 11M words • 20K unique terms (stop words and rare words removed) • Model: 100-topic LDA model using variational inference. D. Blei Modeling Science 14 / 53
  • 21. Example inference 0.4 0.3 Probability 0.2 0.1 0.0 1 8 16 26 36 46 56 66 76 86 96 Topics D. Blei Modeling Science 15 / 53
  • 22. Example topics “Genetics” “Evolution” “Disease” “Computers” human evolution disease computer genome evolutionary host models dna species bacteria information genetic organisms diseases data genes life resistance computers sequence origin bacterial system gene biology new network molecular groups strains systems sequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics group parasite networks mapping new parasites software project two united new sequences common tuberculosis simulations D. Blei Modeling Science 16 / 53
  • 23. LDA summary • LDA is a powerful model for • Visualizing the hidden thematic structure in large corpora • Generalizing new data to fit into that structure • LDA is a mixed membership model (Erosheva, 2004) that builds on the work of Deerwester et al. (1990) and Hofmann (1999). • For document collections and other grouped data, this might be more appropriate than a simple finite mixture D. Blei Modeling Science 17 / 53
  • 24. LDA summary • Modular : It can be embedded in more complicated models. • E.g., syntax and semantics; authorship; word sense • General: The data generating distribution can be changed. • E.g., images; social networks; population genetics data • Variational inference is fast; lets us to analyze large data sets. • See Blei et al., 2003 for details and a quantitative comparison. • Code to play with LDA is freely available on my web-site, http://www.cs.princeton.edu/∼blei. D. Blei Modeling Science 18 / 53
  • 25. LDA summary • But, LDA makes certain assumptions about the data. • When are they appropriate? D. Blei Modeling Science 19 / 53
  • 26. Outline 1 Introduction 2 Latent Dirichlet allocation 3 Dynamic topic models 4 Correlated topic models D. Blei Modeling Science 20 / 53
  • 27. LDA and exchangeability α θd Zd,n Wd,n βk η N D K • LDA assumes that documents are exchangeable. • I.e., their joint probability is invariant to permutation. • This is too restrictive. D. Blei Modeling Science 21 / 53
  • 28. Documents are not exchangeable "Infrared Reflectance in Leaf-Sitting "Instantaneous Photography" (1890) Neotropical Frogs" (1977) • Documents about the same topic are not exchangeable. • Topics evolve over time. D. Blei Modeling Science 22 / 53
  • 29. Dynamic topic model • Divide corpus into sequential slices (e.g., by year). • Assume each slice’s documents exchangeable. • Drawn from an LDA model. • Allow topic distributions evolve from slice to slice. D. Blei Modeling Science 23 / 53
  • 30. Dynamic topic models α α α θd θd θd Zd,n Zd,n Zd,n Wd,n Wd,n Wd,n N N N D D D ... βk,1 βk,2 βk,T K D. Blei Modeling Science 24 / 53
  • 31. Modeling evolving topics βk,1 βk,2 βk,T ... • Use a logistic normal distribution to model evolving topics (Aitchison, 1980) • A state-space model on the natural parameter of the topic multinomial (West and Harrison, 1997) βt,k | βt−1,k ∼ N (βt−1,k , Iσ 2 ) V −1 p(w | βt,k ) = exp βt,k − log(1 + v =1 exp{βt,k ,v }) D. Blei Modeling Science 25 / 53
  • 32. Posterior inference • Our goal is to compute the posterior distribution, p(β1:T ,1:K , θ1:T ,1:D , z1:T ,1:D | w1:T ,1:D ). • Exact inference is impossible • Per-document mixed-membership model • Non-conjugacy between p(w | βt,k ) and p(βt,k ) • MCMC is not practical for the amount of data. • Solution: Variational inference D. Blei Modeling Science 26 / 53
  • 33. Science data TECHVIEW: DNA S E Q U E N C I NG Sequencing the Genome, Fast James C. Mullikin and Amanda A. McMurray Genome sequencing projects reveal the genetic makeup of an organism by reading off the sequence of the DNA bases, which encodes all of the infor- mation necessary for the life of the organ- ism. The base sequence contains four nu- cleotides-adenine, thymidine, guanosine, and cytosine-which are linked together into long double-helical chains. Over the last two decades, automated DNA se- quencers have made the process of obtain- ing the base-by-base sequence of DNA... • Analyze JSTOR’s entire collection from Science (1880-2002) • Restrict to 30K terms that occur more than ten times • The data are 76M words in 130K documents D. Blei Modeling Science 27 / 53
  • 34. Analyzing a document Original article Topic proportions D. Blei Modeling Science 28 / 53
  • 35. Analyzing a document Original article Most likely words from top topics sequence devices data genome device information genes materials network sequences current web human high computer gene gate language dna light networks sequencing silicon time chromosome material software regions technology system analysis electrical words data fiber algorithm genomic power number number based internet D. Blei Modeling Science 28 / 53
  • 36. Analyzing a topic 1880 1890 1900 1910 1920 1930 1940 electric electric apparatus air apparatus tube air machine power steam water tube apparatus tube power company power engineering air glass apparatus engine steam engine apparatus pressure air glass steam electrical engineering room water mercury laboratory two machine water laboratory glass laboratory rubber machines two construction engineer gas pressure pressure iron system engineer made made made small battery motor room gas laboratory gas mercury wire engine feet tube mercury small gas 1950 1960 1970 1980 1990 2000 tube tube air high materials devices apparatus system heat power high device glass temperature power design power materials air air system heat current current chamber heat temperature system applications gate instrument chamber chamber systems technology high small power high devices devices light laboratory high flow instruments design silicon pressure instrument tube control device material rubber control design large heat technology D. Blei Modeling Science 29 / 53
  • 37. Visualizing trends within a topic "Theoretical Physics" "Neuroscience" FORCE OXYGEN o o o o LASER o o o o o o o o o NERVE o o o o o o o o o o o o o o o o o o o o o o o o o o o RELATIVITY o o o o o o o o o o o o o o o o o o o o o o o o o o o o o NEURON o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 1880 1900 1920 1940 1960 1980 2000 1880 1900 1920 1940 1960 1980 2000 D. Blei Modeling Science 30 / 53
  • 38. Time-corrected document similarity • Consider the expected Hellinger distance between the topic proportions of two documents, K dij = E ( θi,k − θj,k )2 | wi , wj k =1 • Uses the latent structure to define similarity • Time has been factored out because the topics associated to the components are different from year to year. • Similarity based only on topic proportions D. Blei Modeling Science 31 / 53
  • 39. Time-corrected document similarity The Brain of the Orang (1880) D. Blei Modeling Science 32 / 53
  • 40. Time-corrected document similarity Representation of the Visual Field on the Medial Wall of Occipital-Parietal Cortex in the Owl Monkey (1976) D. Blei Modeling Science 33 / 53
  • 41. Browser of Science D. Blei Modeling Science 34 / 53
  • 42. Quantitative comparison • Compute the probability of each year’s documents conditional on all the previous year’s documents, p(wt | w1 , . . . , wt−1 ) • Compare exchangeable and dynamic topic models D. Blei Modeling Science 35 / 53
  • 43. Quantitative comparison q LDA 25 DTM Per−word negative log likelihood 20 q 15 q q q q 10 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1920 1940 1960 1980 2000 Year D. Blei Modeling Science 36 / 53
  • 44. Outline 1 Introduction 2 Latent Dirichlet allocation 3 Dynamic topic models 4 Correlated topic models D. Blei Modeling Science 37 / 53
  • 45. The hidden assumptions of the Dirichlet distribution • The Dirichlet is an exponential family distribution on the simplex, positive vectors that sum to one. • However, the near independence of components makes it a poor choice for modeling topic proportions. • An article about fossil fuels is more likely to also be about geology than about genetics. D. Blei Modeling Science 38 / 53
  • 46. The logistic normal distribution • The logistic normal is a distribution on the simplex that can model dependence between components. • The natural parameters of the multinomial are drawn from a multivariate Gaussian distribution. X ∼ NK −1 (µ, ) K −1 θi = exp{xi − log(1 + j=1 exp{xj })} D. Blei Modeling Science 39 / 53
  • 47. Correlated topic model (CTM) Σ βk ηd Zd,n Wd,n K N µ D • Draw topic proportions from a logistic normal, where topic occurrences can exhibit correlation. • Use for: • Providing a “map” of topics and how they are related • Better prediction via correlated topics D. Blei Modeling Science 40 / 53
  • 48. neurons brain stimulus motor memory visual activated subjects synapses tyrosine phosphorylation cortical left ltp activation phosphorylation p53 task surface glutamate kinase cell cycle proteins tip synaptic activity protein cyclin binding rna image neurons regulation domain dna sample materials computer domains rna polymerase organic problem device receptor cleavage information polymer science amino acids research scientists receptors cdna site computers polymers funding molecules physicists support says ligand sequence problems laser particles nih research ligands isolated optical physics program people protein sequence light apoptosis sequences surface particle electrons experiment genome liquid quantum wild type dna surfaces stars mutant enzyme sequencing fluid mutations enzymes model reaction astronomers united states mutants iron active site reactions universe women cells mutation reduction molecule galaxies universities cell molecules expression magnetic galaxy cell lines plants magnetic field transition state students bone marrow plant spin superconductivity gene education genes superconducting pressure mantle arabidopsis bacteria high pressure crust sun bacterial pressures upper mantle solar wind host fossil record core meteorites earth resistance development birds inner core ratios planets mice parasite embryos fossils planet gene dinosaurs antigen virus drosophila species disease fossil t cells hiv genes forest mutations antigens aids expression forests families earthquake co2 immune response infection populations mutation earthquakes carbon viruses ecosystems fault carbon dioxide ancient images methane patients genetic found disease cells population impact data water ozone treatment proteins populations million years ago volcanic atmospheric drugs differences africa clinical researchers deposits climate measurements variation stratosphere protein magma ocean eruption ice concentrations found volcanism changes climate change D. Blei Modeling Science 41 / 53
  • 49. Summary • Topic models provide useful descriptive statistics for analyzing and understanding the latent structure of large text collections. • Probabilistic graphical models are a useful way to express assumptions about the hidden structure of complicated data. • Variational methods allow us to perform posterior inference to automatically infer that structure from large data sets. • Current research • Choosing the number of topics • Continuous time dynamic topic models • Topic models for prediction • Inferring the impact of a document D. Blei Modeling Science 42 / 53
  • 50. “We should seek out unfamiliar summaries of observational material, and establish their useful properties... And still more novelty can come from finding, and evading, still deeper lying constraints.” (John Tukey, The Future of Data Analysis, 1962) D. Blei Modeling Science 43 / 53
  • 51. Supervised topic models (with Jon McAuliffe) • Most topic models are unsupervised. They are fit by maximizing the likelihood of a collection of documents. • Consider documents paired with response variables. For example: • Movie reviews paired with a number of stars • Web pages paired with a number of “diggs” • We develop supervised topic models, models of documents and responses that are fit to find topics predictive of the response. D. Blei Modeling Science 44 / 53
  • 52. Supervised LDA α θd Zd,n Wd,n βk K N Yd D η, σ 2 1 Draw topic proportions θ | α ∼ Dir(α). 2 For each word 1 Draw topic assignment zn | θ ∼ Mult(θ ). 2 Draw word wn | zn , β1:K ∼ Mult(βzn ). 3 Draw response variable y | z1:N , η, σ 2 ∼ N η z, σ 2 , where ¯ N z = (1/N) ¯ n=1 zn . D. Blei Modeling Science 45 / 53
  • 53. Comments • SLDA is used as follows. • Fit coefficients and topics from a collection of document-response pairs. • Use the fitted model to predict the responses of previously unseen documents, E[Y | w1:N , α, β1:K , η, σ 2 ] = η E[Z | w1:N , α, β1:K ]. ¯ • The process enforces that the document is generated first, followed by the response. The response is generated from the particular topics that were realized in generating the document. D. Blei Modeling Science 46 / 53
  • 54. Example: Movie reviews least bad more awful his both problem guys has featuring their motion unfortunately watchable than routine character simple supposed its films dry many perfect worse not director offered while fascinating flat one will charlie performance power dull movie characters paris between complex ● ● ● ●● ● ● ● ● −30 −20 −10 0 10 20 have not one however like about from cinematography you movie there screenplay was all which performances just would who pictures some they much effective out its what picture • We fit a 10-topic sLDA model to movie review data (Pang and Lee, 2005). • The documents are the words of the reviews. • The responses are the number of stars associated with each review (modeled as continuous). • Each component of coefficient vector η is associated with a topic. D. Blei Modeling Science 47 / 53
  • 55. Simulations Movie corpus 0.5 ● ● ● ● ● ● ● ● −6.37 ● ● ● ● ● Per−word held out log likelihood 0.4 −6.38 ● ● ● ● ● ● Predictive R2 ● 0.3 −6.39 ● ● ● ● ● ● ● ● ● ● ● ● ● ● −6.40 0.2 ● ● ● ● −6.41 0.1 ● sLDA LDA −6.42 ● 0.0 5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 Number of topics Number of topics Digg corpus −8.0 0.12 −8.1 0.10 Per−word held out log likelihood ● −8.2 ● 0.08 ● ● ● ● ● ● Predictive R2 −8.3 ● ● ● 0.06 ● ● −8.4 ● ● 0.04 ● ● −8.5 0.02 ● ● ● −8.6 ● ● ● 0.00 ● ● ● ● 2 4 10 20 30 2 4 10 20 30 Number of topics Number of topics D. Blei Modeling Science 48 / 53
  • 56. Diversion: Variational inference • Let x1:N be observations and z1:M be latent variables • Our goal is to compute the posterior distribution p(z1:M , x1:N ) p(z1:M | x1:N ) = p(z1:M , x1:N )dz1:M • For many interesting distributions, the marginal likelihood of the observations is difficult to efficiently compute D. Blei Modeling Science 49 / 53
  • 57. Variational inference • Use Jensen’s inequality to bound the log prob of the observations: log p(x1:N ) ≥ Eqν [log p(z1:M , x1:N )] − Eqν [log qν (z1:M )]. • We have introduced a distribution of the latent variables with free variational parameters ν. • We optimize those parameters to tighten this bound. • This is the same as finding the member of the family qν that is closest in KL divergence to p(z1:M | x1:N ). D. Blei Modeling Science 50 / 53
  • 58. Mean-field variational inference • Complexity of optimization is determined by factorization of qν • In mean field variational inference qν is fully factored M qν (z1:M ) = qνm (zm ). m=1 • The latent variables are independent. • Each is governed by its own variational parameter νm . • In the true posterior they can exhibit dependence (often, this is what makes exact inference difficult). D. Blei Modeling Science 51 / 53
  • 59. MFVI and conditional exponential families • Suppose the distribution of each latent variable conditional on the observations and other latent variables is in the exponential family: p(zm | z−m , x) = hm (zm ) exp{gm (z−m , x)T zm − am (gi (z−m , x))} • Assume qν is fully factorized and each factor is in the same exponential family: qνm (zm ) = hm (zm ) exp{νm zm − am (νm )} T D. Blei Modeling Science 52 / 53
  • 60. MFVI and conditional exponential families • Variational inference is the following coordinate ascent algorithm νm = Eqν [gm (Z−m , x)] • Notice the relationship to Gibbs sampling D. Blei Modeling Science 52 / 53
  • 61. Variational family for the DTM βk,1 βk,2 βk,T ... ˆ βk,1 ˆ βk,2 ˆ βk,T • Distribution of θ and z is fully-factorized (Blei et al., 2003) • Distribution of {β1,k , . . . , βT ,k } is a variational Kalman filter • Gaussian state-space model with free observations βk ,t . ˆ • Fit observations such that the corresponding posterior over the chain is close to the true posterior. D. Blei Modeling Science 53 / 53
  • 62. Variational family for the DTM βk,1 βk,2 βk,T ... ˆ βk,1 ˆ βk,2 ˆ βk,T • Given a document collection, use coordinate ascent on all the variational parameters until the KL converges. • Yields a distribution close to the true posterior of interest • Take expectations w/r/t the simpler variational distribution D. Blei Modeling Science 53 / 53