SlideShare a Scribd company logo
1 of 72
Download to read offline
Building Mini-Google in Ruby


                                                                              Ilya Grigorik
                                                                                          @igrigorik


Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
postrank.com/topic/ruby




                               The slides…                           Twitter                           My blog


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank             @igrigorik #railsconf
Ruby + Math
                                                                            PageRank
              Optimization




                               Examples                                          Indexing
   Misc Fun




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
PageRank                                       PageRank + Ruby




      Tools
        +                       Examples                                       Indexing
   Optimization

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Consume with care…
                     everything that follows is based on released / public domain info




Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Search-engine graveyard
                                                                  Google did pretty well…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Query: Ruby




                                                                                           Results




       1. Crawl                           2. Index                                         3. Rank




                                                                  Search pipeline
                                                                                  50,000-foot view



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Query: Ruby




                                                                                                Results




       1. Crawl                           2. Index                                          3. Rank




            Bah                           Interesting                                     Fun




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
CPU Speed                                     333Mhz
          RAM                                           32-64MB

          Index                                         27,000,000 documents
          Index refresh                                 once a month~ish
          PageRank computation                          several days

          Laptop CPU                                    2.1Ghz
          VM RAM                                        1GB
          1-Million page web                            ~10 minutes


                                                                  circa 1997-1998



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Creating & Maintaining an Inverted Index
                                                                    DIY and the gotchas within




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'set'
                                                      {
                                                         quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
    pages = {
                                                         quot;aquot;=>#<Set: {quot;3quot;}>,
     quot;1quot; => quot;it is what it isquot;,
                                                         quot;bananaquot;=>#<Set: {quot;3quot;}>,
     quot;2quot; => quot;what is itquot;,
                                                         quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
     quot;3quot; => quot;it is a bananaquot;
                                                         quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
    }
                                                        }
    index = {}

    pages.each do |page, content|
                                                                  Word => [Document]
     content.split(/s/).each do |word|
      if index[word]
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                    1                        3
                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>


                                                                                 2
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                  Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
# query: quot;what is bananaquot;
 p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;]
 # > #<Set: {}>


 # query: quot;a bananaquot;
 p index[quot;aquot;] & index[quot;bananaquot;]
 # > #<Set: {quot;3quot;}>

                                                                  What order?
 # query: quot;what isquot;
 p index[quot;whatquot;] & index[quot;isquot;]
                                                                  [1, 2] or [2,1]
 # > #<Set: {quot;1quot;, quot;2quot;}>


 {
   quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>,
   quot;aquot;=>#<Set: {quot;3quot;}>,
   quot;bananaquot;=>#<Set: {quot;3quot;}>,
   quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>,
   quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>}
                                                                    Querying the index
  }



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank            @igrigorik #railsconf
require 'set'

    pages = {
     quot;1quot; => quot;it is what it isquot;,
     quot;2quot; => quot;what is itquot;,
     quot;3quot; => quot;it is a bananaquot;
    }
                                                                    PDF, HTML, RSS?
    index = {}
                                                                  Lowercase / Upcase?
    pages.each do |page, content|                                   Compact Index?
                                                                        Hmmm?
     content.split(/s/).each do |word|                               Stop words?
      if index[word]                                                  Persistence?
        index[word] << page
      else
        index[word] = Set.new(page)
      end
     end
    end


                                                Building an Inverted Index

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank         @igrigorik #railsconf
Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ferret is a high-performance, full-featured text search engine library written for Ruby



Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
    index << {:title => quot;2quot;, :content => quot;what is itquot;}
    index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}

    index.search_each('content:quot;bananaquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end


    > Score: 1.0, 3




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require 'ferret'
    include Ferret

    index = Index::Index.new()

    index << {:title => quot;1quot;, :content => quot;it is what it isquot;}
    index << {:title => quot;2quot;, :content => quot;what is itquot;}
    index << {:title => quot;3quot;, :content => quot;it is a bananaquot;}

    index.search_each('content:quot;bananaquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end


    > Score: 1.0, 3


                               Hmmm?




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
class Ferret::Analysis::Analyzer                                        class Ferret::Search::BooleanQuery
   class Ferret::Analysis::AsciiLetterAnalyzer                             class Ferret::Search::ConstantScoreQuery
   class Ferret::Analysis::AsciiLetterTokenizer                            class Ferret::Search::Explanation
   class Ferret::Analysis::AsciiLowerCaseFilter                            class Ferret::Search::Filter
   class Ferret::Analysis::AsciiStandardAnalyzer                           class Ferret::Search::FilteredQuery
   class Ferret::Analysis::AsciiStandardTokenizer                          class Ferret::Search::FuzzyQuery
   class Ferret::Analysis::AsciiWhiteSpaceAnalyzer                         class Ferret::Search::Hit
   class Ferret::Analysis::AsciiWhiteSpaceTokenizer                        class Ferret::Search::MatchAllQuery
   class Ferret::Analysis::HyphenFilter                                    class Ferret::Search::MultiSearcher
   class Ferret::Analysis::LetterAnalyzer                                  class Ferret::Search::MultiTermQuery
   class Ferret::Analysis::LetterTokenizer                                 class Ferret::Search::PhraseQuery
   class Ferret::Analysis::LowerCaseFilter                                 class Ferret::Search::PrefixQuery
   class Ferret::Analysis::MappingFilter                                   class Ferret::Search::Query
   class Ferret::Analysis::PerFieldAnalyzer                                class Ferret::Search::QueryFilter
   class Ferret::Analysis::RegExpAnalyzer                                  class Ferret::Search::RangeFilter
   class Ferret::Analysis::RegExpTokenizer                                 class Ferret::Search::RangeQuery
   class Ferret::Analysis::StandardAnalyzer                                class Ferret::Search::Searcher
   class Ferret::Analysis::StandardTokenizer                               class Ferret::Search::Sort
   class Ferret::Analysis::StemFilter                                      class Ferret::Search::SortField
   class Ferret::Analysis::StopFilter                                      class Ferret::Search::TermQuery
   class Ferret::Analysis::Token                                           class Ferret::Search::TopDocs
   class Ferret::Analysis::TokenStream                                     class Ferret::Search::TypedRangeFilter
   class Ferret::Analysis::WhiteSpaceAnalyzer                              class Ferret::Search::TypedRangeQuery
                                                                           class Ferret::Search::WildcardQuery
   class Ferret::Analysis::WhiteSpaceTokenizer



Building Mini-Google in Ruby            http://bit.ly/railsconf-pagerank              @igrigorik #railsconf
ferret.davebalmain.com/trac




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Ranking Results
                                                                      0-60 with PageRank…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end

    > Score: 0.827, 3
    > Score: 0.523, 5                                                 Relevance?
    > Score: 0.125, 4

                               3                     5                   4
            the                4                     3                   5
          brown                1                     3                   1
            cow                1                     4                   1
     Score                     6                    10                   7


                                                              Naïve: Term Frequency

Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
index.search_each('content:quot;the brown cowquot;') do |id, score|
     puts quot;Score: #{score}, #{index[id][:title]} quot;
    end

    > Score: 0.827, 3
    > Score: 0.523, 5
    > Score: 0.125, 4

                               3                     5                4
            the                4                     3                5
                                                                                                  Skew
          brown                1                     3                1
            cow                1                     4                1
     Score                     6                    10                7


                                                              Naïve: Term Frequency

Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
3                         5                4
            the                4                         3                5
                                                                                                       Skew
          brown                1                         3                1
            cow                1                         4                1


                               # of docs
                                                              Score = TF * IDF
                   the             6
                                                              TF = # occurrences / # words
                  brown            3
                                                              IDF = # docs / # docs with W
                   cow             4


     Total # of documents:                     10


                                                                                                      TF-IDF
                                           Term Frequency * Inverse Document Frequency


Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
3                         5                  4
            the                 4                         3                  5
          brown                 1                         3                  1
            cow                 1                         4                  1


                                                                      Doc # 3 score for ‘the’:
                               # of docs
                                                                      4/10 * ln(10/6) = 0.204
                   the              6
                                                                      Doc # 3 score for ‘brown’:
                  brown             3
                                                                      1/10 * ln(10/3) = 0.120
                   cow              4
                                                                      Doc # 3 score for ‘cow’:
                                                                      1/10 * ln(10/4) = 0.092
     Total # of documents:                      10
     # words in document:                       10


                                                                                                         TF-IDF
                               Score = 0.204 + 0.120 + 0.092 = 0.416



Building Mini-Google in Ruby            http://bit.ly/railsconf-pagerank         @igrigorik #railsconf
W1        W2   …           …             …        …       …             …     WN

         Doc 1       15        23   …
         Doc 2       24        12   …
         …            …        …    …
         …
         Doc K

         Size = N * K * size of Ruby object
                                                                                   Ouch.
          Pages = N = 10,000
          Words = K = 2,000
          Ruby Object = 20+ bytes

                                                                       Frequency Matrix
          Footprint = 384 MB



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
NArray is an Numerical N-dimensional Array class (implemented in C)



                                                      #    create new NArray. initialize with 0.
       NArray.new(typecode, size, ...)
                                                      #    1 byte unsigned integer
       NArray.byte(size,...)
                                                      #    2 byte signed integer
       NArray.sint(size,...)
                                                      #    4 byte signed integer
       NArray.int(size,...)
                                                      #    single precision float
       NArray.sfloat(size,...)
                                                      #    double precision float
       NArray.float(size,...)
                                                      #    single precision complex
       NArray.scomplex(size,...)
                                                      #    double precision complex
       NArray.complex(size,...)
                                                      #    Ruby object
       NArray.object(size,...)




                                                                                              NArray
                                                                    http://narray.rubyforge.org/



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
NArray is an Numerical N-dimensional Array class (implemented in C)




                                                                                            NArray
                                                                  http://narray.rubyforge.org/



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
Links as votes




                                                                                        PageRank
                                                                                     the google juice
               Problem: link gaming




Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
P = 0.85



                               Follow link from page he/she is currently on.



                               Teleport to a random location on the web.



                                   P = 0.15

                                                                               Random Surfer
                                                                                    powerful abstraction




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Follow link from page he/she is currently on.
                                                                                                      Page K

                               Teleport to a random location on the web.



        Page N                         Page M
                                                                                                    Surfin’
                                                                          rinse & repeat, ad naseum




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
On Page P, clicks on link to K
                                                                                         P = 0.85


                               On Page K clicks on link to M
                                                                                         P = 0.85


                               On Page M teleports to X

           P = 0.15
                                                                                                    Surfin’
                                                …
                                                                          rinse & repeat, ad naseum




Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                       Analyzing the Web Graph
                                                                               extracting PageRank




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
What is PageRank?
                                                                                             It’s a scalar!

Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                                      What is PageRank?
                                                                                       it’s a probability!




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
P = 0.05                                                      P = 0.20
                                                               X
                               N

                                                                                     P = 0.15
                                                                      M
                                   K
                  P = 0.6




                                                                      What is PageRank?
          Higher Pr, Higher Importance?
                                                                                       it’s a probability!




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Teleportation?
                                                                                           sci-fi fans, … ?




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
1. No in-links!                                                        3. Isolated Web



                                           X
         N
                          K
                                                                                       2. No out-links!
                                        M
                                                                  M



                                                  Reasons for teleportation
                                                                      enumerating edge cases



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
•Breadth First Search
                               •Depth First Search
                               •A* Search
                               •Lexicographic Search
                               •Dijkstra’s Algorithm
                               •Floyd-Warshall
                               •Triangulation and Comparability detection

require 'gratr/import'

dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]

dg.directed? # true
dg.vertex?(4) # true
dg.edge?(2,4) # true
dg.vertices # [5, 6, 1, 2, 3, 4]
                                                                       Exploring Graphs
Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5]                                  gratr.rubyforge.com
Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
P(T) = 0.03
                                                                     P(T) = 0.15 / # of pages
        P(T) = 0.03
                                                                     P(T) = 0.03
                                            X
         N
                          K                                        P(T) = 0.03

                                         M
                P(T) = 0.03
                                                                   M
                               P(T) = 0.03


                                                                             Teleportation
                                                                                                probabilities



Building Mini-Google in Ruby    http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Assume the web is N pages big
    Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85
    Assume that teleportation probability (E) is uniform
    Assume that you start on any random page (uniform distribution L), then

                                                  0.15
                                                            ������
                                ������ = ������ =            ⋮
                                                  0.15
                                                            ������
    Then after one step, the probability your on page X is:
                                      ������ ∗ ������������ + ������������

                               ������ ∗ (0.85 ∗ ������ + 0.15 ∗ ������)


                     PageRank: Simplified Mathematical Def’n
                                                                    cause that’s how we roll



Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Link Graph                                                 No link from 1 to N



                           1   2                …                 …         N
             1             1   0                 …                …          0

             2             0   1                 …                …          1

             …             …   …                 …                …         …

             …             …   …                 …                …         …

             N             0   1                 …                …          1



                                                                  G = The Link Graph
                     Huge!
                                                                        ginormous and sparse



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Links to…

                               {
                                   quot;1quot;      =>         [25, 26],
                                   quot;2quot;      =>         [1],
                  Page
                                   quot;5quot;      =>         [123,2],
                                   quot;6quot;      =>         [67, 1]
                               }



                                                                          G as a dictionary
                                                                                         more compact…



Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Follow link from page he/she is currently on.
                                                                                                       Page K

                               Teleport to a random location on the web.




                                                                          Computing PageRank
                                                                                              the tedious way



Building Mini-Google in Ruby           http://bit.ly/railsconf-pagerank        @igrigorik #railsconf
Don’t trust me! Verify it yourself!


                                                                                ������1
                                                               −1                ⋮
                               ������ = ������ ������ − ������������                         ������ =
                                                                                ������������
                                    Identity matrix




                                                                         Computing PageRank
                                                                                                    in one swoop



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
Enough hand-waving, dammit!
                                                                                  show me the code




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Hot, Fast, Awesome

                                                                  Birth of EM-Proxy
                                                                            flash of the obvious




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
http://rb-gsl.rubyforge.org/




                                                                        Hot, Fast, Awesome




                       Click there! … Give yourself a weekend.


Building Mini-Google in Ruby         http://bit.ly/railsconf-pagerank      @igrigorik #railsconf
http://ruby-gsl.sourceforge.net/
                       Click there! … Give yourself a weekend.


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
                                                                         Verify NxN
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)                                                          Constants…
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
require quot;gslquot;
   include GSL

   # INPUT: link structure matrix (NxN)
   # OUTPUT: pagerank scores
   def pagerank(g)
    raise if g.size1 != g.size2

     i = Matrix.I(g.size1)                      # identity matrix
     p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

     s = 0.85              # probability of following a link
     t = 1-s               # probability of teleportation

    t*((i-s*g).invert)*p
   end



                                                                         PageRank in Ruby
                  PageRank!
                                                                                              6 lines, or less



Building Mini-Google in Ruby          http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
X
                 P = 0.33                                              P = 0.33
                               N


                                                                  P = 0.33
                                                    K


        pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])
        > [0.33, 0.33, 0.33]


                                                                  Ex: Circular Web
                                                                               testing intuition…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank    @igrigorik #railsconf
X
                 P = 0.05                                               P = 0.07
                               N


                                                                   P = 0.87
                                                    K


        pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])
        > [0.05, 0.07, 0.87]


                                                              Ex: All roads lead to K
                                                                                testing intuition…




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank     @igrigorik #railsconf
PageRank + Ferret
                                                                             awesome search, ftw!




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
2
                               P = 0.05                                                          P = 0.07
                                                           1


require 'ferret'                                                                              P = 0.87
                                                                      3
include Ferret

index = Index::Index.new()

index << {:title => quot;1quot;, :content => quot;it is what it isquot;, :pr => 0.05 }
index << {:title => quot;2quot;, :content => quot;what is itquot;, :pr => 0.07 }
index << {:title => quot;3quot;, :content => quot;it is a bananaquot;, :pr => 0.87 }



                                                                               Store PageRank




Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end

                               TF-IDF Search
puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby       http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end
                                                PageRank FTW!
puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#   Score: 0.267119228839874, 3 (PR: 0.87)
#   Score: 0.17807948589325, 1 (PR: 0.05)
#   Score: 0.17807948589325, 2 (PR: 0.07)
#   ***********************************
#   Score: 0.267119228839874, 3, (PR: 0.87)
#   Score: 0.17807948589325, 2, (PR: 0.07)
#   Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
index.search_each('content:quot;worldquot;') do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot;
end

puts quot;*quot; * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score|
 puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot;
end


#    Score: 0.267119228839874, 3 (PR: 0.87)
#    Score: 0.17807948589325, 1 (PR: 0.05)                                         Others
#    Score: 0.17807948589325, 2 (PR: 0.07)
#    ***********************************
#    Score: 0.267119228839874, 3, (PR: 0.87)
#    Score: 0.17807948589325, 2, (PR: 0.07)                                        Google
#    Score: 0.17807948589325, 1, (PR: 0.05)



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Search*: Graphs are ubiquitous!
                                                 PageRank is a general purpose hammer




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Username               GitCred
                                                                    ==============================
                                                                    37signals              10.00
                                                                    imbriaco               9.76
                                                                    why                    8.74
                                                                    rails                  8.56
                                                                    defunkt                8.17
                                                                    technoweenie           7.83
                                                                    jeresig                7.60
                                                                    mojombo                7.51
                                                                    yui                    7.34
                                                                    drnic                  7.34
                                                                    pjhyett                6.91
                                                                    wycats                 6.85
                                                                    dhh                    6.84

           http://bit.ly/3YQPU

                                                       PageRank + Social Graph
                                                                                                    GitHub




Building Mini-Google in Ruby     http://bit.ly/railsconf-pagerank           @igrigorik #railsconf
Hmm…




                                                                  Analyze the social graph:
                                                                  - Filter messages by ‘TwitterRank’
                                                                  - Suggest users by ‘TwitterRank’
                                                                  -…
                                                     PageRank + Social Graph
                                                                                                   Twitter




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank            @igrigorik #railsconf
PageRank + Product Graph
                                                                                          E-commerce

                                  Link items purchased in same cart… Run PR on it.



Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
PageRank = Powerful Hammer
                                                                                          use it!




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
Personalization
                                                                      how would you do it?




Building Mini-Google in Ruby   http://bit.ly/railsconf-pagerank   @igrigorik #railsconf
0.15
                               ������         Teleportation distribution doesn’t
       ������ =            ⋮                          have to be uniform!
                    0.15
                               ������


                               yahoo.com is
                               my homepage!


                                                  PageRank + Personalization
                                                                 customize the teleportation vector




Building Mini-Google in Ruby        http://bit.ly/railsconf-pagerank       @igrigorik #railsconf
Make pages with links!




                                                                      Gaming PageRank
       http://bit.ly/pagerank-spam                       for fun and profit (I don’t endorse it)




Building Mini-Google in Ruby    http://bit.ly/railsconf-pagerank          @igrigorik #railsconf
Slides: http://bit.ly/railsconf-pagerank

    Ferret: http://bit.ly/ferret
    RB-GSL: http://bit.ly/rb-gsl

    PageRank on Wikipedia: http://bit.ly/wp-pagerank
    Gaming PageRank: http://bit.ly/pagerank-spam

    Michael Nielsen’s lectures on PageRank:
    http://michaelnielsen.org/blog


                                                                                     Questions?

                               The slides…                           Twitter                           My blog


Building Mini-Google in Ruby      http://bit.ly/railsconf-pagerank             @igrigorik #railsconf

More Related Content

Similar to Building Mini Google in Ruby

Building A Mini Google High Performance Computing In Ruby Presentation 1
Building A Mini Google  High Performance Computing In Ruby Presentation 1Building A Mini Google  High Performance Computing In Ruby Presentation 1
Building A Mini Google High Performance Computing In Ruby Presentation 1elliando dias
 
It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.Tomohiro Nishimura
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosLindsay Holmwood
 
Google G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebGoogle G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebQConLondon2008
 
Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1QConLondon2008
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesLincoln III
 
Intro To Django
Intro To DjangoIntro To Django
Intro To DjangoUdi Bauman
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Tobias Pfeiffer
 
Exploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsExploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsMarian Marinov
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Railsdosire
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Librariesjeresig
 
Microformats HTML to API
Microformats HTML to APIMicroformats HTML to API
Microformats HTML to APIelliando dias
 
Rapid RIA development with Netzke
Rapid RIA development with NetzkeRapid RIA development with Netzke
Rapid RIA development with Netzkenetzke
 
Rails bestpractices.com
Rails bestpractices.comRails bestpractices.com
Rails bestpractices.comRichard Huang
 

Similar to Building Mini Google in Ruby (20)

Building A Mini Google High Performance Computing In Ruby Presentation 1
Building A Mini Google  High Performance Computing In Ruby Presentation 1Building A Mini Google  High Performance Computing In Ruby Presentation 1
Building A Mini Google High Performance Computing In Ruby Presentation 1
 
It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.It's Mechanize for it. Ruby as a Finder.
It's Mechanize for it. Ruby as a Finder.
 
Monitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagiosMonitoring web application behaviour with cucumber-nagios
Monitoring web application behaviour with cucumber-nagios
 
Happy Coding with Ruby on Rails
Happy Coding with Ruby on RailsHappy Coding with Ruby on Rails
Happy Coding with Ruby on Rails
 
Google G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The WebGoogle G Data Reading And Writing Data On The Web
Google G Data Reading And Writing Data On The Web
 
Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1Google G Data Reading And Writing Data On The Web 1
Google G Data Reading And Writing Data On The Web 1
 
Web application intro
Web application introWeb application intro
Web application intro
 
SearchMonkey
SearchMonkeySearchMonkey
SearchMonkey
 
Beautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFacesBeautiful Java EE - PrettyFaces
Beautiful Java EE - PrettyFaces
 
Intro To Django
Intro To DjangoIntro To Django
Intro To Django
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)Web application intro + a bit of ruby (revised)
Web application intro + a bit of ruby (revised)
 
Exploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your pluginsExploiting the newer perl to improve your plugins
Exploiting the newer perl to improve your plugins
 
Rails + Webpack
Rails + WebpackRails + Webpack
Rails + Webpack
 
When To Use Ruby On Rails
When To Use Ruby On RailsWhen To Use Ruby On Rails
When To Use Ruby On Rails
 
More Secrets of JavaScript Libraries
More Secrets of JavaScript LibrariesMore Secrets of JavaScript Libraries
More Secrets of JavaScript Libraries
 
Microformats HTML to API
Microformats HTML to APIMicroformats HTML to API
Microformats HTML to API
 
Rapid RIA development with Netzke
Rapid RIA development with NetzkeRapid RIA development with Netzke
Rapid RIA development with Netzke
 
Cucumber
CucumberCucumber
Cucumber
 
Rails bestpractices.com
Rails bestpractices.comRails bestpractices.com
Rails bestpractices.com
 

More from Ilya Grigorik

Pagespeed what, why, and how it works
Pagespeed   what, why, and how it worksPagespeed   what, why, and how it works
Pagespeed what, why, and how it worksIlya Grigorik
 
Making the web fast(er) - RailsConf 2012
Making the web fast(er) - RailsConf 2012Making the web fast(er) - RailsConf 2012
Making the web fast(er) - RailsConf 2012Ilya Grigorik
 
0-60 with Goliath: High performance web services
0-60 with Goliath: High performance web services0-60 with Goliath: High performance web services
0-60 with Goliath: High performance web servicesIlya Grigorik
 
0-60 with Goliath: Building High Performance Ruby Web-Services
0-60 with Goliath: Building High Performance Ruby Web-Services0-60 with Goliath: Building High Performance Ruby Web-Services
0-60 with Goliath: Building High Performance Ruby Web-ServicesIlya Grigorik
 
Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011Ilya Grigorik
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIlya Grigorik
 
No callbacks, No Threads - Cooperative web servers in Ruby 1.9
No callbacks, No Threads - Cooperative web servers in Ruby 1.9No callbacks, No Threads - Cooperative web servers in Ruby 1.9
No callbacks, No Threads - Cooperative web servers in Ruby 1.9Ilya Grigorik
 
No Callbacks, No Threads - RailsConf 2010
No Callbacks, No Threads - RailsConf 2010No Callbacks, No Threads - RailsConf 2010
No Callbacks, No Threads - RailsConf 2010Ilya Grigorik
 
Real-time Ruby for the Real-time Web
Real-time Ruby for the Real-time WebReal-time Ruby for the Real-time Web
Real-time Ruby for the Real-time WebIlya Grigorik
 
Ruby C10K: High Performance Networking - RubyKaigi '09
Ruby C10K: High Performance Networking - RubyKaigi '09Ruby C10K: High Performance Networking - RubyKaigi '09
Ruby C10K: High Performance Networking - RubyKaigi '09Ilya Grigorik
 
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09Ilya Grigorik
 
Leveraging Social Media - Strategies & Tactics - PostRank
Leveraging Social Media - Strategies & Tactics - PostRankLeveraging Social Media - Strategies & Tactics - PostRank
Leveraging Social Media - Strategies & Tactics - PostRankIlya Grigorik
 
Ruby Proxies for Scale, Performance, and Monitoring
Ruby Proxies for Scale, Performance, and MonitoringRuby Proxies for Scale, Performance, and Monitoring
Ruby Proxies for Scale, Performance, and MonitoringIlya Grigorik
 
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comRuby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comIlya Grigorik
 
Event Driven Architecture - MeshU - Ilya Grigorik
Event Driven Architecture - MeshU - Ilya GrigorikEvent Driven Architecture - MeshU - Ilya Grigorik
Event Driven Architecture - MeshU - Ilya GrigorikIlya Grigorik
 
Taming The RSS Beast
Taming The  RSS  BeastTaming The  RSS  Beast
Taming The RSS BeastIlya Grigorik
 

More from Ilya Grigorik (16)

Pagespeed what, why, and how it works
Pagespeed   what, why, and how it worksPagespeed   what, why, and how it works
Pagespeed what, why, and how it works
 
Making the web fast(er) - RailsConf 2012
Making the web fast(er) - RailsConf 2012Making the web fast(er) - RailsConf 2012
Making the web fast(er) - RailsConf 2012
 
0-60 with Goliath: High performance web services
0-60 with Goliath: High performance web services0-60 with Goliath: High performance web services
0-60 with Goliath: High performance web services
 
0-60 with Goliath: Building High Performance Ruby Web-Services
0-60 with Goliath: Building High Performance Ruby Web-Services0-60 with Goliath: Building High Performance Ruby Web-Services
0-60 with Goliath: Building High Performance Ruby Web-Services
 
Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011Ruby in the Browser - RubyConf 2011
Ruby in the Browser - RubyConf 2011
 
Intelligent Ruby + Machine Learning
Intelligent Ruby + Machine LearningIntelligent Ruby + Machine Learning
Intelligent Ruby + Machine Learning
 
No callbacks, No Threads - Cooperative web servers in Ruby 1.9
No callbacks, No Threads - Cooperative web servers in Ruby 1.9No callbacks, No Threads - Cooperative web servers in Ruby 1.9
No callbacks, No Threads - Cooperative web servers in Ruby 1.9
 
No Callbacks, No Threads - RailsConf 2010
No Callbacks, No Threads - RailsConf 2010No Callbacks, No Threads - RailsConf 2010
No Callbacks, No Threads - RailsConf 2010
 
Real-time Ruby for the Real-time Web
Real-time Ruby for the Real-time WebReal-time Ruby for the Real-time Web
Real-time Ruby for the Real-time Web
 
Ruby C10K: High Performance Networking - RubyKaigi '09
Ruby C10K: High Performance Networking - RubyKaigi '09Ruby C10K: High Performance Networking - RubyKaigi '09
Ruby C10K: High Performance Networking - RubyKaigi '09
 
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
Lean & Mean Tokyo Cabinet Recipes (with Lua) - FutureRuby '09
 
Leveraging Social Media - Strategies & Tactics - PostRank
Leveraging Social Media - Strategies & Tactics - PostRankLeveraging Social Media - Strategies & Tactics - PostRank
Leveraging Social Media - Strategies & Tactics - PostRank
 
Ruby Proxies for Scale, Performance, and Monitoring
Ruby Proxies for Scale, Performance, and MonitoringRuby Proxies for Scale, Performance, and Monitoring
Ruby Proxies for Scale, Performance, and Monitoring
 
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.comRuby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
Ruby Proxies for Scale, Performance, and Monitoring - GoGaRuCo - igvita.com
 
Event Driven Architecture - MeshU - Ilya Grigorik
Event Driven Architecture - MeshU - Ilya GrigorikEvent Driven Architecture - MeshU - Ilya Grigorik
Event Driven Architecture - MeshU - Ilya Grigorik
 
Taming The RSS Beast
Taming The  RSS  BeastTaming The  RSS  Beast
Taming The RSS Beast
 

Recently uploaded

Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptUsmanKaran
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)ssuser583c35
 
11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxSasikiranMarri
 

Recently uploaded (6)

Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.ppt
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)
 
11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf
 
13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptx
 

Building Mini Google in Ruby

  • 1. Building Mini-Google in Ruby Ilya Grigorik @igrigorik Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 2. postrank.com/topic/ruby The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 3. Ruby + Math PageRank Optimization Examples Indexing Misc Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 4. PageRank PageRank + Ruby Tools + Examples Indexing Optimization Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 5. Consume with care… everything that follows is based on released / public domain info Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 6. Search-engine graveyard Google did pretty well… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 7. Query: Ruby Results 1. Crawl 2. Index 3. Rank Search pipeline 50,000-foot view Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 8. Query: Ruby Results 1. Crawl 2. Index 3. Rank Bah Interesting Fun Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 9. CPU Speed 333Mhz RAM 32-64MB Index 27,000,000 documents Index refresh once a month~ish PageRank computation several days Laptop CPU 2.1Ghz VM RAM 1GB 1-Million page web ~10 minutes circa 1997-1998 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 10. Creating & Maintaining an Inverted Index DIY and the gotchas within Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 11. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 12. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 13. require 'set' { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, pages = { quot;aquot;=>#<Set: {quot;3quot;}>, quot;1quot; => quot;it is what it isquot;, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;2quot; => quot;what is itquot;, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;3quot; => quot;it is a bananaquot; quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} } } index = {} pages.each do |page, content| Word => [Document] content.split(/s/).each do |word| if index[word] index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 14. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 1 3 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 15. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 16. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> 2 # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 17. # query: quot;what is bananaquot; p index[quot;whatquot;] & index[quot;isquot;] & index[quot;bananaquot;] # > #<Set: {}> # query: quot;a bananaquot; p index[quot;aquot;] & index[quot;bananaquot;] # > #<Set: {quot;3quot;}> What order? # query: quot;what isquot; p index[quot;whatquot;] & index[quot;isquot;] [1, 2] or [2,1] # > #<Set: {quot;1quot;, quot;2quot;}> { quot;itquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>, quot;aquot;=>#<Set: {quot;3quot;}>, quot;bananaquot;=>#<Set: {quot;3quot;}>, quot;whatquot;=>#<Set: {quot;1quot;, quot;2quot;}>, quot;isquot;=>#<Set: {quot;1quot;, quot;2quot;, quot;3quot;}>} Querying the index } Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 18. require 'set' pages = { quot;1quot; => quot;it is what it isquot;, quot;2quot; => quot;what is itquot;, quot;3quot; => quot;it is a bananaquot; } PDF, HTML, RSS? index = {} Lowercase / Upcase? pages.each do |page, content| Compact Index? Hmmm? content.split(/s/).each do |word| Stop words? if index[word] Persistence? index[word] << page else index[word] = Set.new(page) end end end Building an Inverted Index Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 19. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 20. Ferret is a high-performance, full-featured text search engine library written for Ruby Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 21. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 22. require 'ferret' include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;} index << {:title => quot;2quot;, :content => quot;what is itquot;} index << {:title => quot;3quot;, :content => quot;it is a bananaquot;} index.search_each('content:quot;bananaquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 1.0, 3 Hmmm? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 23. class Ferret::Analysis::Analyzer class Ferret::Search::BooleanQuery class Ferret::Analysis::AsciiLetterAnalyzer class Ferret::Search::ConstantScoreQuery class Ferret::Analysis::AsciiLetterTokenizer class Ferret::Search::Explanation class Ferret::Analysis::AsciiLowerCaseFilter class Ferret::Search::Filter class Ferret::Analysis::AsciiStandardAnalyzer class Ferret::Search::FilteredQuery class Ferret::Analysis::AsciiStandardTokenizer class Ferret::Search::FuzzyQuery class Ferret::Analysis::AsciiWhiteSpaceAnalyzer class Ferret::Search::Hit class Ferret::Analysis::AsciiWhiteSpaceTokenizer class Ferret::Search::MatchAllQuery class Ferret::Analysis::HyphenFilter class Ferret::Search::MultiSearcher class Ferret::Analysis::LetterAnalyzer class Ferret::Search::MultiTermQuery class Ferret::Analysis::LetterTokenizer class Ferret::Search::PhraseQuery class Ferret::Analysis::LowerCaseFilter class Ferret::Search::PrefixQuery class Ferret::Analysis::MappingFilter class Ferret::Search::Query class Ferret::Analysis::PerFieldAnalyzer class Ferret::Search::QueryFilter class Ferret::Analysis::RegExpAnalyzer class Ferret::Search::RangeFilter class Ferret::Analysis::RegExpTokenizer class Ferret::Search::RangeQuery class Ferret::Analysis::StandardAnalyzer class Ferret::Search::Searcher class Ferret::Analysis::StandardTokenizer class Ferret::Search::Sort class Ferret::Analysis::StemFilter class Ferret::Search::SortField class Ferret::Analysis::StopFilter class Ferret::Search::TermQuery class Ferret::Analysis::Token class Ferret::Search::TopDocs class Ferret::Analysis::TokenStream class Ferret::Search::TypedRangeFilter class Ferret::Analysis::WhiteSpaceAnalyzer class Ferret::Search::TypedRangeQuery class Ferret::Search::WildcardQuery class Ferret::Analysis::WhiteSpaceTokenizer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 24. ferret.davebalmain.com/trac Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 25. Ranking Results 0-60 with PageRank… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 26. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 Relevance? > Score: 0.125, 4 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 27. index.search_each('content:quot;the brown cowquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} quot; end > Score: 0.827, 3 > Score: 0.523, 5 > Score: 0.125, 4 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 Score 6 10 7 Naïve: Term Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 28. 3 5 4 the 4 3 5 Skew brown 1 3 1 cow 1 4 1 # of docs Score = TF * IDF the 6 TF = # occurrences / # words brown 3 IDF = # docs / # docs with W cow 4 Total # of documents: 10 TF-IDF Term Frequency * Inverse Document Frequency Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 29. 3 5 4 the 4 3 5 brown 1 3 1 cow 1 4 1 Doc # 3 score for ‘the’: # of docs 4/10 * ln(10/6) = 0.204 the 6 Doc # 3 score for ‘brown’: brown 3 1/10 * ln(10/3) = 0.120 cow 4 Doc # 3 score for ‘cow’: 1/10 * ln(10/4) = 0.092 Total # of documents: 10 # words in document: 10 TF-IDF Score = 0.204 + 0.120 + 0.092 = 0.416 Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 30. W1 W2 … … … … … … WN Doc 1 15 23 … Doc 2 24 12 … … … … … … Doc K Size = N * K * size of Ruby object Ouch. Pages = N = 10,000 Words = K = 2,000 Ruby Object = 20+ bytes Frequency Matrix Footprint = 384 MB Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 31. NArray is an Numerical N-dimensional Array class (implemented in C) # create new NArray. initialize with 0. NArray.new(typecode, size, ...) # 1 byte unsigned integer NArray.byte(size,...) # 2 byte signed integer NArray.sint(size,...) # 4 byte signed integer NArray.int(size,...) # single precision float NArray.sfloat(size,...) # double precision float NArray.float(size,...) # single precision complex NArray.scomplex(size,...) # double precision complex NArray.complex(size,...) # Ruby object NArray.object(size,...) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 32. NArray is an Numerical N-dimensional Array class (implemented in C) NArray http://narray.rubyforge.org/ Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 33. Links as votes PageRank the google juice Problem: link gaming Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 34. P = 0.85 Follow link from page he/she is currently on. Teleport to a random location on the web. P = 0.15 Random Surfer powerful abstraction Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 35. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Page N Page M Surfin’ rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 36. On Page P, clicks on link to K P = 0.85 On Page K clicks on link to M P = 0.85 On Page M teleports to X P = 0.15 Surfin’ … rinse & repeat, ad naseum Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 37. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 Analyzing the Web Graph extracting PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 38. What is PageRank? It’s a scalar! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 39. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 40. P = 0.05 P = 0.20 X N P = 0.15 M K P = 0.6 What is PageRank? Higher Pr, Higher Importance? it’s a probability! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 41. Teleportation? sci-fi fans, … ? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 42. 1. No in-links! 3. Isolated Web X N K 2. No out-links! M M Reasons for teleportation enumerating edge cases Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 43. •Breadth First Search •Depth First Search •A* Search •Lexicographic Search •Dijkstra’s Algorithm •Floyd-Warshall •Triangulation and Comparability detection require 'gratr/import' dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6] dg.directed? # true dg.vertex?(4) # true dg.edge?(2,4) # true dg.vertices # [5, 6, 1, 2, 3, 4] Exploring Graphs Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5] gratr.rubyforge.com Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4] Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 44. P(T) = 0.03 P(T) = 0.15 / # of pages P(T) = 0.03 P(T) = 0.03 X N K P(T) = 0.03 M P(T) = 0.03 M P(T) = 0.03 Teleportation probabilities Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 45. Assume the web is N pages big Assume that probability of teleportation (t) is 0.15, and following link (s) is 0.85 Assume that teleportation probability (E) is uniform Assume that you start on any random page (uniform distribution L), then 0.15 ������ ������ = ������ = ⋮ 0.15 ������ Then after one step, the probability your on page X is: ������ ∗ ������������ + ������������ ������ ∗ (0.85 ∗ ������ + 0.15 ∗ ������) PageRank: Simplified Mathematical Def’n cause that’s how we roll Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 46. Link Graph No link from 1 to N 1 2 … … N 1 1 0 … … 0 2 0 1 … … 1 … … … … … … … … … … … … N 0 1 … … 1 G = The Link Graph Huge! ginormous and sparse Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 47. Links to… { quot;1quot; => [25, 26], quot;2quot; => [1], Page quot;5quot; => [123,2], quot;6quot; => [67, 1] } G as a dictionary more compact… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 48. Follow link from page he/she is currently on. Page K Teleport to a random location on the web. Computing PageRank the tedious way Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 49. Don’t trust me! Verify it yourself! ������1 −1 ⋮ ������ = ������ ������ − ������������ ������ = ������������ Identity matrix Computing PageRank in one swoop Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 50. Enough hand-waving, dammit! show me the code Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 51. Hot, Fast, Awesome Birth of EM-Proxy flash of the obvious Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 52. http://rb-gsl.rubyforge.org/ Hot, Fast, Awesome Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 53. http://ruby-gsl.sourceforge.net/ Click there! … Give yourself a weekend. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 54. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Verify NxN raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 55. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) Constants… raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 56. require quot;gslquot; include GSL # INPUT: link structure matrix (NxN) # OUTPUT: pagerank scores def pagerank(g) raise if g.size1 != g.size2 i = Matrix.I(g.size1) # identity matrix p = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector s = 0.85 # probability of following a link t = 1-s # probability of teleportation t*((i-s*g).invert)*p end PageRank in Ruby PageRank! 6 lines, or less Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 57. X P = 0.33 P = 0.33 N P = 0.33 K pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]]) > [0.33, 0.33, 0.33] Ex: Circular Web testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 58. X P = 0.05 P = 0.07 N P = 0.87 K pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]]) > [0.05, 0.07, 0.87] Ex: All roads lead to K testing intuition… Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 59. PageRank + Ferret awesome search, ftw! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 60. 2 P = 0.05 P = 0.07 1 require 'ferret' P = 0.87 3 include Ferret index = Index::Index.new() index << {:title => quot;1quot;, :content => quot;it is what it isquot;, :pr => 0.05 } index << {:title => quot;2quot;, :content => quot;what is itquot;, :pr => 0.07 } index << {:title => quot;3quot;, :content => quot;it is a bananaquot;, :pr => 0.87 } Store PageRank Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 61. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end TF-IDF Search puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 62. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end PageRank FTW! puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 63. index.search_each('content:quot;worldquot;') do |id, score| puts quot;Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})quot; end puts quot;*quot; * 50 sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true) index.search_each('content:quot;worldquot;', :sort => sf_pr) do |id, score| puts quot;Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})quot; end # Score: 0.267119228839874, 3 (PR: 0.87) # Score: 0.17807948589325, 1 (PR: 0.05) Others # Score: 0.17807948589325, 2 (PR: 0.07) # *********************************** # Score: 0.267119228839874, 3, (PR: 0.87) # Score: 0.17807948589325, 2, (PR: 0.07) Google # Score: 0.17807948589325, 1, (PR: 0.05) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 64. Search*: Graphs are ubiquitous! PageRank is a general purpose hammer Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 65. Username GitCred ============================== 37signals 10.00 imbriaco 9.76 why 8.74 rails 8.56 defunkt 8.17 technoweenie 7.83 jeresig 7.60 mojombo 7.51 yui 7.34 drnic 7.34 pjhyett 6.91 wycats 6.85 dhh 6.84 http://bit.ly/3YQPU PageRank + Social Graph GitHub Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 66. Hmm… Analyze the social graph: - Filter messages by ‘TwitterRank’ - Suggest users by ‘TwitterRank’ -… PageRank + Social Graph Twitter Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 67. PageRank + Product Graph E-commerce Link items purchased in same cart… Run PR on it. Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 68. PageRank = Powerful Hammer use it! Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 69. Personalization how would you do it? Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 70. 0.15 ������ Teleportation distribution doesn’t ������ = ⋮ have to be uniform! 0.15 ������ yahoo.com is my homepage! PageRank + Personalization customize the teleportation vector Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 71. Make pages with links! Gaming PageRank http://bit.ly/pagerank-spam for fun and profit (I don’t endorse it) Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf
  • 72. Slides: http://bit.ly/railsconf-pagerank Ferret: http://bit.ly/ferret RB-GSL: http://bit.ly/rb-gsl PageRank on Wikipedia: http://bit.ly/wp-pagerank Gaming PageRank: http://bit.ly/pagerank-spam Michael Nielsen’s lectures on PageRank: http://michaelnielsen.org/blog Questions? The slides… Twitter My blog Building Mini-Google in Ruby http://bit.ly/railsconf-pagerank @igrigorik #railsconf