SlideShare ist ein Scribd-Unternehmen logo
1 von 77
Downloaden Sie, um offline zu lesen
Importance of Web Pages

          PageRank
   Topic-Specific PageRank
     Hubs and Authorities
    Combatting Link Spam
                             1
PageRank
 Intuition: solve the recursive equation:
“a page is important if important pages
link to it.”
 In high-falutin’ terms: importance =
the principal eigenvector of the
transition matrix of the Web.
  A few fixups needed.

                                            2
Transition Matrix of the Web
Enumerate pages.
Page i corresponds to row and column i.
M [i, j ] = 1/n if page j links to n
pages, including page i ; 0 if j does not
link to i.
  M [i, j ] is the probability we’ll next be at
  page i if we are now at page j.

                                                  3
Example: Transition Matrix
Suppose page j links to 3 pages, including i but not x.
                            j

               i

                                        1/3
               x


                                    0


                                                          4
Random Walks on the Web
Suppose v is a vector whose i th
component is the probability that each
random walker is at page i at a certain
time.
If each walker follows a link from i at
random, the probability distribution for
walkers is then given by the vector M v.

                                       5
Random Walks – (2)
Starting from any vector v, the limit
M (M (…M (M v ) …)) is the long-term
distribution of walkers.
Intuition: pages are important in
proportion to how likely a walker is to
be there.
The math: limiting distribution =
principal eigenvector of M = PageRank.
                                          6
Example: The Web in 1839
                                 y a m
             Yahoo            y 1/2 1/2 0
                              a 1/2 0 1
                              m 0 1/2 0




    Amazon           M’soft




                                            7
Solving The Equations
Because there are no constant terms,
the equations v = M v do not have a
unique solution.
In Web-sized examples, we cannot
solve by Gaussian elimination anyway;
we need to use relaxation (= iterative
solution).
Can work if you start with a fixed v.
                                         8
Simulating a Random Walk
Start with the vector v = [1, 1,…, 1]
representing the idea that each Web
page is given one unit of importance.
Repeatedly apply the matrix M to v,
allowing the importance to flow like a
random walk.
About 50 iterations is sufficient to
estimate the limiting solution.
                                         9
Example: Iterating Equations
Equations v = M v :
y = y /2 + a /2           Note: “=” is
a = y /2 + m              really “assignment.”
m = a /2

 y       1    1     5/4     9/8           6/5
 a =     1    3/2    1      11/8   ...    6/5
 m       1    1/2   3/4     1/2           3/5

                                                 10
The Walkers
         Yahoo




Amazon           M’soft




                          11
The Walkers
         Yahoo




Amazon           M’soft




                          12
The Walkers
         Yahoo




Amazon           M’soft




                          13
The Walkers
         Yahoo




Amazon           M’soft




                          14
In the Limit …
         Yahoo




Amazon           M’soft




                          15
Real-World Problems
 Some pages are “dead ends” (have no
links out).
  Such a page causes importance to leak out.
Other groups of pages are spider traps
(all out-links are within the group).
  Eventually spider traps absorb all importance.



                                               16
Microsoft Becomes Dead End
                                  y a m
              Yahoo            y 1/2 1/2 0
                               a 1/2 0 0
                               m 0 1/2 0




     Amazon           M’soft




                                         17
Example: Effect of Dead Ends
 Equations v = M v :
 y = y /2 + a /2
 a = y /2
 m = a /2

  y       1    1     3/4   5/8         0
  a =     1    1/2   1/2   3/8   ...   0
  m       1    1/2   1/4   1/4         0

                                           18
Microsoft Becomes a Dead End
               Yahoo




      Amazon           M’soft




                                19
Microsoft Becomes a Dead End
               Yahoo




      Amazon           M’soft




                                20
Microsoft Becomes a Dead End
               Yahoo




      Amazon           M’soft




                                21
Microsoft Becomes a Dead End
               Yahoo




      Amazon           M’soft




                                22
In the Limit …
         Yahoo




Amazon           M’soft




                          23
M’soft Becomes Spider Trap
                                  y a m
              Yahoo            y 1/2 1/2 0
                               a 1/2 0 0
                               m 0 1/2 1




     Amazon           M’soft




                                         24
Example: Effect of Spider Trap
 Equations v = M v :
 y = y /2 + a /2
 a = y /2
 m = a /2 + m

  y       1    1     3/4   5/8         0
  a =     1    1/2   1/2   3/8   ...   0
  m       1    3/2   7/4   2           3

                                           25
Microsoft Becomes a Spider Trap
                Yahoo




       Amazon           M’soft




                                 26
Microsoft Becomes a Spider Trap
                Yahoo




       Amazon           M’soft




                                 27
Microsoft Becomes a Spider Trap
                Yahoo




       Amazon           M’soft




                                 28
In the Limit …
         Yahoo




Amazon           M’soft




                          29
PageRank Solution to Traps, Etc.
   “Tax” each page a fixed percentage at
  each interation.
   Add a fixed constant to all pages.
   Models a random walk with a fixed
  probability of leaving the system, and a
  fixed number of new walkers injected
  into the system at each step.

                                         30
Example: Microsoft is a Spider
      Trap; 20% Tax
 Equations v = 0.8(M v ) + 0.2:
  y = 0.8(y /2 + a/2) + 0.2
  a = 0.8(y /2) + 0.2
  m = 0.8(a /2 + m) + 0.2

   y       1   1.00 0.84   0.776          7/11
   a =     1   0.60 0.60   0.536 . . .    5/11
   m       1   1.40 1.56   1.688         21/11

                                                 31
General Case
 In this example, because there are no
dead-ends, the total importance
remains at 3.
 In examples with dead-ends, some
importance leaks out, but total remains
finite.


                                          32
Solving the Equations
Because there are constant terms, we
can expect to solve small examples by
Gaussian elimination.
Web-sized examples still need to be
solved by relaxation.



                                        33
Finding a Good Starting Vector
1. Newton-like prediction of where
   components of the principal
   eigenvector are heading.
2. Take advantage of locality in the Web.
   Each technique can reduce the
   number of iterations by 50%.
     Important – PageRank takes time!

                                        34
Predicting Component Values
 Three consecutive values for the
importance of a page suggests where
the limit might be.
  1.0
                    Guess for the next round
        0.7
              0.6   0.55




                                          35
Exploiting Substructure
Pages from particular domains, hosts,
or directories, like stanford.edu or
infolab.stanford.edu/~ullman
tend to have many internal links.
Initialize PageRank using ranks within
your local cluster, then ranking the
clusters themselves.

                                         36
Strategy
 Compute local PageRanks (in parallel?).
 Use local weights to establish weights on
edges between clusters.
 Compute PageRank on graph of clusters.
 Initial rank of a page is the product of its
local rank and the rank of its cluster.
 “Clusters” are appropriately sized regions
with common domain or lower-level detail.
                                          37
In Pictures
         1.5                     2.05


      3.0 2.0

          0.15      0.1

Local ranks
Intercluster weights      0.05

Ranks of clusters

Initial eigenvector                     38
Topic-Specific Page Rank
Goal: Evaluate Web pages not just
according to their popularity, but by
how close they are to a particular topic,
e.g. “sports” or “history.”
Allows search queries to be answered
based on interests of the user.
  Example: Query Maccabi wants different
  pages depending on whether you are
  interested in sports or history.
                                           39
Teleport Sets
Assume each walker has a small
probability of “teleporting” at any tick.
Teleport can go to:
1. Any page with equal probability.
     As in the “taxation” scheme.
2. A topic-specific set of “relevant” pages
   (teleport set ).
     For topic-specific PageRank.

                                              40
Example: Topic = Software
Only Microsoft is in the teleport set.
Assume 20% “tax.”
  I.e., probability of a teleport is 20%.




                                            41
Only Microsoft in Teleport Set
               Yahoo

                                Dr. Who’s
                                phone
                                booth.

      Amazon           M’soft




                                   42
Only Microsoft in Teleport Set
               Yahoo




      Amazon           M’soft




                                43
Only Microsoft in Teleport Set
               Yahoo




      Amazon           M’soft




                                44
Only Microsoft in Teleport Set
               Yahoo




      Amazon           M’soft




                                45
Only Microsoft in Teleport Set
               Yahoo




      Amazon           M’soft




                                46
Only Microsoft in Teleport Set
               Yahoo




      Amazon           M’soft




                                47
Only Microsoft in Teleport Set
               Yahoo




      Amazon           M’soft




                                48
Picking the Teleport Set
1. Choose the pages belonging to the
   topic in Open Directory.
2. “Learn” from examples the typical
   words in pages belonging to the topic;
   use pages heavy in those words as the
   teleport set.


                                        49
Application: Link Spam
Spam farmers today create networks of
millions of pages designed to focus
PageRank on a few undeserving pages.
To minimize their influence, use a
teleport set consisting of trusted pages
only.
  Example: home pages of universities.

                                         50
Hubs and Authorities
Mutually recursive definition:
  A hub links to many authorities;
  An authority is linked to by many hubs.
 Authorities turn out to be places where
information can be found.
  Example: course home pages.
Hubs tell where the authorities are.
  Example: CS Dept. course-listing page.
                                            51
Transition Matrix A
 H&A uses a matrix A [i, j ] = 1 if page i
links to page j, 0 if not.
 AT, the transpose of A, is similar to the
PageRank matrix M, but AT has 1’s
where M has fractions.



                                         52
Example: H&A Transition Matrix
                                 y   a m
                Yahoo      y     1   1   1
                        A= a     1   0   1
                           m     0   1   0




       Amazon           M’soft




                                         53
Using Matrix A for H&A
Powers of A and AT have elements of
exponential size, so we need scale factors.
Let h and a be vectors measuring the
“hubbiness” and authority of each page.
Equations: h = λAa; a = µAT h.
  Hubbiness = scaled sum of authorities of
  successor pages (out-links).
  Authority = scaled sum of hubbiness of
  predecessor pages (in-links).
                                             54
Consequences of Basic Equations
  From h = λAa; a = µAT h we can
  derive:
    h = λµAAT h
    a = λµATA a
  Compute h and a by iteration,
  assuming initially each page has one
  unit of hubbiness and one unit of
  authority.
    Pick an appropriate value of λµ.
                                         55
Example: Iterating H&A

   111         110          321              212
A= 101    AT = 1 0 1   AAT= 2 2 0       ATA= 1 2 1
   010         110          101              212


 a(yahoo) =    1   5   24   114 . . .     1+√3
 a(amazon) =   1   4   18    84 . . .     2
 a(m’soft) =   1   5   24   114 . . .     1+√3

 h(yahoo)  =   1   6   28   132 . . .     1.000
 h(amazon) =   1   4   20    96 . . .     0.735
 h(m’soft) =   1   2    8    36 . . .     0.268   56
Solving H&A in Practice
Iterate as for PageRank; don’t try to
solve equations.
But keep components within bounds.
  Example: scale to keep the largest
  component of the vector at 1.
Trick: start with h = [1,1,…,1]; multiply
by AT to get first a; scale, then multiply
by A to get next h,…
                                         57
Solving H&A – (2)
You may be tempted to compute AAT
and ATA first, then iterate these
matrices as for PageRank.
Bad, because these matrices are not
nearly as sparse as A and AT.



                                      58
H&A Versus PageRank
If you talk to someone from IBM, they
may tell you “IBM invented PageRank.”
  What they mean is that H&A was invented
  by Jon Kleinberg when he was at IBM.
But these are not the same.
H&A does not appear to be a substitute
for PageRank.
But may be used by Ask.com.
                                        59
Spam on the Web
Search has become the default
gateway to the web.
Very high premium to appear on the
first page of search results.




                                     60
What is Web Spam?
Spamming = any action whose purpose
is to boost a web page’s position in
search engine results, without providing
additional value.
Spam = Web pages used for spamming.
Approximately 10-15% of Web pages
are spam.

                                      61
Web-Spam Taxonomy
Boosting techniques :
 Techniques for increasing the probability a
 Web page will be a highly ranked answer
 to a search query.
Hiding techniques :
 Techniques to hide the use of boosting
 from humans and Web crawlers.


                                           62
Hiding techniques
Content hiding.
  Use same color for text and page background.
Cloaking.
  Return different page to crawlers and browsers.
Redirection.
  Redirects are followed by browsers but not
  crawlers.



                                                    63
Boosting Techniques
Term spamming :
 Manipulating the text of Web pages in
 order to appear relevant to queries.
  • Why? You can run ads that are relevant to the
    query.
Link spamming :
 Creating link structures that boost
 PageRank.

                                                64
Term Spamming – (1)
Repetition : of one or a few specific
terms e.g., “free,” “cheap,” “Viagra.”
Dumping of a large number of
unrelated terms.
  E.g., copy entire dictionaries.




                                         65
Term Spamming – (2)
Weaving :
  Copy legitimate pages and insert spam
  terms at random positions.
Phrase Stitching :
  Glue together sentences and phrases
  from different sources.
  • E.g., use the top-ranked pages on the topic
    you want to look like.

                                                  66
The Google Solution to Term
        Spamming
 In addition to PageRank, the original
Google engine had another innovation:
it trusted what people said about you in
preference to what you said about
yourself.
 Give more weight to words that appear
in or near anchor text than to words
that appear in the page itself.
                                       67
The Google Solution – (2)
Today, the Google formula for
matching terms to documents involves
over 250 factors.
  E.g., does the word appear in a header?
As closely guarded as the formula for
Coke.


                                            68
Link Spam
Three kinds of Web pages from a
spammer’s point of view:
1. Own pages.
  •   Completely controlled by spammer.
2. Accessible pages.
  •   E.g., Web-log comment pages: spammer can
      post links to his pages.
3. Inaccessible pages.

                                             69
Spam Farms – (1)
Spammer’s goal:
   Maximize the PageRank of target page t.
Technique:
1. Get as many links from accessible pages
   as possible to target page t.
2. Construct “link farm” to get PageRank
   multiplier effect.

                                             70
Spam Farms – (2)
                Accessible       Own

                                       1
Inaccessible
 Inaccessible
                                       2
                             t




                                       M



 Goal: boost PageRank of page t.
 One of the most common and effective
 organizations for a spam farm.            71
Analysis – (1)
                     Accessible        Own
                                             1
   Inaccessible
    Inaccessible                             2
                                   t


                                         M

Suppose rank from accessible pages = x.
PageRank of target page = y.        Share of
                                    “tax”
Taxation rate = 1-β.
Rank of each “farm” page = βy/M + (1-β)/N.
                          From t                 Size of   72
                                                 Web
Analysis – (2)
                     Accessible       Own       Tax share
                                            1   for t.
   Inaccessible
    Inaccessible                            2   Very small;
                                  t             ignore.


                                        M

y = x + βM[βy/M + (1-β)/N] + (1-β)/N
y = x + β2y + β(1-β)M/N
y = x/(1-β2) + cM/N where c = β/(1+β)
           PageRank of
           each “farm” page
                                                       73
Analysis – (3)
                    Accessible       Own
                                           1
  Inaccessible
   Inaccessible                            2
                                 t


                                       M

y = x/(1-β2) + cM/N where c = β/(1+β).
For β = 0.85, 1/(1-β2)= 3.6.
  Multiplier effect for “acquired” page rank.
 By making M large, we can make y as
large as we want.                               74
Detecting Link-Spam
Topic-specific PageRank, with a set of
“trusted” pages as the teleport set is
called TrustRank.
Spam Mass =
    (PageRank – TrustRank)/PageRank.
  High spam mass means most of your
  PageRank comes from untrusted sources –
  you may be link-spam.
                                         75
Picking the Trusted Set
Two conflicting considerations:
 Human has to inspect each seed page, so
 seed set must be as small as possible.
 Must ensure every “good page” gets
 adequate TrustRank, so all good pages
 should be reachable from the trusted set
 by short paths.


                                            76
Approaches to Picking the
         Trusted Set
1. Pick the top k pages by PageRank.
     It is almost impossible to get a spam
     page to the very top of the PageRank
     order.
2. Pick the home pages of universities.
     Domains like .edu are controlled.


                                             77

Weitere ähnliche Inhalte

Kürzlich hochgeladen

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Kürzlich hochgeladen (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Empfohlen

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Empfohlen (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Diapo

  • 1. Importance of Web Pages PageRank Topic-Specific PageRank Hubs and Authorities Combatting Link Spam 1
  • 2. PageRank Intuition: solve the recursive equation: “a page is important if important pages link to it.” In high-falutin’ terms: importance = the principal eigenvector of the transition matrix of the Web. A few fixups needed. 2
  • 3. Transition Matrix of the Web Enumerate pages. Page i corresponds to row and column i. M [i, j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i. M [i, j ] is the probability we’ll next be at page i if we are now at page j. 3
  • 4. Example: Transition Matrix Suppose page j links to 3 pages, including i but not x. j i 1/3 x 0 4
  • 5. Random Walks on the Web Suppose v is a vector whose i th component is the probability that each random walker is at page i at a certain time. If each walker follows a link from i at random, the probability distribution for walkers is then given by the vector M v. 5
  • 6. Random Walks – (2) Starting from any vector v, the limit M (M (…M (M v ) …)) is the long-term distribution of walkers. Intuition: pages are important in proportion to how likely a walker is to be there. The math: limiting distribution = principal eigenvector of M = PageRank. 6
  • 7. Example: The Web in 1839 y a m Yahoo y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M’soft 7
  • 8. Solving The Equations Because there are no constant terms, the equations v = M v do not have a unique solution. In Web-sized examples, we cannot solve by Gaussian elimination anyway; we need to use relaxation (= iterative solution). Can work if you start with a fixed v. 8
  • 9. Simulating a Random Walk Start with the vector v = [1, 1,…, 1] representing the idea that each Web page is given one unit of importance. Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk. About 50 iterations is sufficient to estimate the limiting solution. 9
  • 10. Example: Iterating Equations Equations v = M v : y = y /2 + a /2 Note: “=” is a = y /2 + m really “assignment.” m = a /2 y 1 1 5/4 9/8 6/5 a = 1 3/2 1 11/8 ... 6/5 m 1 1/2 3/4 1/2 3/5 10
  • 11. The Walkers Yahoo Amazon M’soft 11
  • 12. The Walkers Yahoo Amazon M’soft 12
  • 13. The Walkers Yahoo Amazon M’soft 13
  • 14. The Walkers Yahoo Amazon M’soft 14
  • 15. In the Limit … Yahoo Amazon M’soft 15
  • 16. Real-World Problems Some pages are “dead ends” (have no links out). Such a page causes importance to leak out. Other groups of pages are spider traps (all out-links are within the group). Eventually spider traps absorb all importance. 16
  • 17. Microsoft Becomes Dead End y a m Yahoo y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0 Amazon M’soft 17
  • 18. Example: Effect of Dead Ends Equations v = M v : y = y /2 + a /2 a = y /2 m = a /2 y 1 1 3/4 5/8 0 a = 1 1/2 1/2 3/8 ... 0 m 1 1/2 1/4 1/4 0 18
  • 19. Microsoft Becomes a Dead End Yahoo Amazon M’soft 19
  • 20. Microsoft Becomes a Dead End Yahoo Amazon M’soft 20
  • 21. Microsoft Becomes a Dead End Yahoo Amazon M’soft 21
  • 22. Microsoft Becomes a Dead End Yahoo Amazon M’soft 22
  • 23. In the Limit … Yahoo Amazon M’soft 23
  • 24. M’soft Becomes Spider Trap y a m Yahoo y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M’soft 24
  • 25. Example: Effect of Spider Trap Equations v = M v : y = y /2 + a /2 a = y /2 m = a /2 + m y 1 1 3/4 5/8 0 a = 1 1/2 1/2 3/8 ... 0 m 1 3/2 7/4 2 3 25
  • 26. Microsoft Becomes a Spider Trap Yahoo Amazon M’soft 26
  • 27. Microsoft Becomes a Spider Trap Yahoo Amazon M’soft 27
  • 28. Microsoft Becomes a Spider Trap Yahoo Amazon M’soft 28
  • 29. In the Limit … Yahoo Amazon M’soft 29
  • 30. PageRank Solution to Traps, Etc. “Tax” each page a fixed percentage at each interation. Add a fixed constant to all pages. Models a random walk with a fixed probability of leaving the system, and a fixed number of new walkers injected into the system at each step. 30
  • 31. Example: Microsoft is a Spider Trap; 20% Tax Equations v = 0.8(M v ) + 0.2: y = 0.8(y /2 + a/2) + 0.2 a = 0.8(y /2) + 0.2 m = 0.8(a /2 + m) + 0.2 y 1 1.00 0.84 0.776 7/11 a = 1 0.60 0.60 0.536 . . . 5/11 m 1 1.40 1.56 1.688 21/11 31
  • 32. General Case In this example, because there are no dead-ends, the total importance remains at 3. In examples with dead-ends, some importance leaks out, but total remains finite. 32
  • 33. Solving the Equations Because there are constant terms, we can expect to solve small examples by Gaussian elimination. Web-sized examples still need to be solved by relaxation. 33
  • 34. Finding a Good Starting Vector 1. Newton-like prediction of where components of the principal eigenvector are heading. 2. Take advantage of locality in the Web. Each technique can reduce the number of iterations by 50%. Important – PageRank takes time! 34
  • 35. Predicting Component Values Three consecutive values for the importance of a page suggests where the limit might be. 1.0 Guess for the next round 0.7 0.6 0.55 35
  • 36. Exploiting Substructure Pages from particular domains, hosts, or directories, like stanford.edu or infolab.stanford.edu/~ullman tend to have many internal links. Initialize PageRank using ranks within your local cluster, then ranking the clusters themselves. 36
  • 37. Strategy Compute local PageRanks (in parallel?). Use local weights to establish weights on edges between clusters. Compute PageRank on graph of clusters. Initial rank of a page is the product of its local rank and the rank of its cluster. “Clusters” are appropriately sized regions with common domain or lower-level detail. 37
  • 38. In Pictures 1.5 2.05 3.0 2.0 0.15 0.1 Local ranks Intercluster weights 0.05 Ranks of clusters Initial eigenvector 38
  • 39. Topic-Specific Page Rank Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. “sports” or “history.” Allows search queries to be answered based on interests of the user. Example: Query Maccabi wants different pages depending on whether you are interested in sports or history. 39
  • 40. Teleport Sets Assume each walker has a small probability of “teleporting” at any tick. Teleport can go to: 1. Any page with equal probability. As in the “taxation” scheme. 2. A topic-specific set of “relevant” pages (teleport set ). For topic-specific PageRank. 40
  • 41. Example: Topic = Software Only Microsoft is in the teleport set. Assume 20% “tax.” I.e., probability of a teleport is 20%. 41
  • 42. Only Microsoft in Teleport Set Yahoo Dr. Who’s phone booth. Amazon M’soft 42
  • 43. Only Microsoft in Teleport Set Yahoo Amazon M’soft 43
  • 44. Only Microsoft in Teleport Set Yahoo Amazon M’soft 44
  • 45. Only Microsoft in Teleport Set Yahoo Amazon M’soft 45
  • 46. Only Microsoft in Teleport Set Yahoo Amazon M’soft 46
  • 47. Only Microsoft in Teleport Set Yahoo Amazon M’soft 47
  • 48. Only Microsoft in Teleport Set Yahoo Amazon M’soft 48
  • 49. Picking the Teleport Set 1. Choose the pages belonging to the topic in Open Directory. 2. “Learn” from examples the typical words in pages belonging to the topic; use pages heavy in those words as the teleport set. 49
  • 50. Application: Link Spam Spam farmers today create networks of millions of pages designed to focus PageRank on a few undeserving pages. To minimize their influence, use a teleport set consisting of trusted pages only. Example: home pages of universities. 50
  • 51. Hubs and Authorities Mutually recursive definition: A hub links to many authorities; An authority is linked to by many hubs. Authorities turn out to be places where information can be found. Example: course home pages. Hubs tell where the authorities are. Example: CS Dept. course-listing page. 51
  • 52. Transition Matrix A H&A uses a matrix A [i, j ] = 1 if page i links to page j, 0 if not. AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions. 52
  • 53. Example: H&A Transition Matrix y a m Yahoo y 1 1 1 A= a 1 0 1 m 0 1 0 Amazon M’soft 53
  • 54. Using Matrix A for H&A Powers of A and AT have elements of exponential size, so we need scale factors. Let h and a be vectors measuring the “hubbiness” and authority of each page. Equations: h = λAa; a = µAT h. Hubbiness = scaled sum of authorities of successor pages (out-links). Authority = scaled sum of hubbiness of predecessor pages (in-links). 54
  • 55. Consequences of Basic Equations From h = λAa; a = µAT h we can derive: h = λµAAT h a = λµATA a Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority. Pick an appropriate value of λµ. 55
  • 56. Example: Iterating H&A 111 110 321 212 A= 101 AT = 1 0 1 AAT= 2 2 0 ATA= 1 2 1 010 110 101 212 a(yahoo) = 1 5 24 114 . . . 1+√3 a(amazon) = 1 4 18 84 . . . 2 a(m’soft) = 1 5 24 114 . . . 1+√3 h(yahoo) = 1 6 28 132 . . . 1.000 h(amazon) = 1 4 20 96 . . . 0.735 h(m’soft) = 1 2 8 36 . . . 0.268 56
  • 57. Solving H&A in Practice Iterate as for PageRank; don’t try to solve equations. But keep components within bounds. Example: scale to keep the largest component of the vector at 1. Trick: start with h = [1,1,…,1]; multiply by AT to get first a; scale, then multiply by A to get next h,… 57
  • 58. Solving H&A – (2) You may be tempted to compute AAT and ATA first, then iterate these matrices as for PageRank. Bad, because these matrices are not nearly as sparse as A and AT. 58
  • 59. H&A Versus PageRank If you talk to someone from IBM, they may tell you “IBM invented PageRank.” What they mean is that H&A was invented by Jon Kleinberg when he was at IBM. But these are not the same. H&A does not appear to be a substitute for PageRank. But may be used by Ask.com. 59
  • 60. Spam on the Web Search has become the default gateway to the web. Very high premium to appear on the first page of search results. 60
  • 61. What is Web Spam? Spamming = any action whose purpose is to boost a web page’s position in search engine results, without providing additional value. Spam = Web pages used for spamming. Approximately 10-15% of Web pages are spam. 61
  • 62. Web-Spam Taxonomy Boosting techniques : Techniques for increasing the probability a Web page will be a highly ranked answer to a search query. Hiding techniques : Techniques to hide the use of boosting from humans and Web crawlers. 62
  • 63. Hiding techniques Content hiding. Use same color for text and page background. Cloaking. Return different page to crawlers and browsers. Redirection. Redirects are followed by browsers but not crawlers. 63
  • 64. Boosting Techniques Term spamming : Manipulating the text of Web pages in order to appear relevant to queries. • Why? You can run ads that are relevant to the query. Link spamming : Creating link structures that boost PageRank. 64
  • 65. Term Spamming – (1) Repetition : of one or a few specific terms e.g., “free,” “cheap,” “Viagra.” Dumping of a large number of unrelated terms. E.g., copy entire dictionaries. 65
  • 66. Term Spamming – (2) Weaving : Copy legitimate pages and insert spam terms at random positions. Phrase Stitching : Glue together sentences and phrases from different sources. • E.g., use the top-ranked pages on the topic you want to look like. 66
  • 67. The Google Solution to Term Spamming In addition to PageRank, the original Google engine had another innovation: it trusted what people said about you in preference to what you said about yourself. Give more weight to words that appear in or near anchor text than to words that appear in the page itself. 67
  • 68. The Google Solution – (2) Today, the Google formula for matching terms to documents involves over 250 factors. E.g., does the word appear in a header? As closely guarded as the formula for Coke. 68
  • 69. Link Spam Three kinds of Web pages from a spammer’s point of view: 1. Own pages. • Completely controlled by spammer. 2. Accessible pages. • E.g., Web-log comment pages: spammer can post links to his pages. 3. Inaccessible pages. 69
  • 70. Spam Farms – (1) Spammer’s goal: Maximize the PageRank of target page t. Technique: 1. Get as many links from accessible pages as possible to target page t. 2. Construct “link farm” to get PageRank multiplier effect. 70
  • 71. Spam Farms – (2) Accessible Own 1 Inaccessible Inaccessible 2 t M Goal: boost PageRank of page t. One of the most common and effective organizations for a spam farm. 71
  • 72. Analysis – (1) Accessible Own 1 Inaccessible Inaccessible 2 t M Suppose rank from accessible pages = x. PageRank of target page = y. Share of “tax” Taxation rate = 1-β. Rank of each “farm” page = βy/M + (1-β)/N. From t Size of 72 Web
  • 73. Analysis – (2) Accessible Own Tax share 1 for t. Inaccessible Inaccessible 2 Very small; t ignore. M y = x + βM[βy/M + (1-β)/N] + (1-β)/N y = x + β2y + β(1-β)M/N y = x/(1-β2) + cM/N where c = β/(1+β) PageRank of each “farm” page 73
  • 74. Analysis – (3) Accessible Own 1 Inaccessible Inaccessible 2 t M y = x/(1-β2) + cM/N where c = β/(1+β). For β = 0.85, 1/(1-β2)= 3.6. Multiplier effect for “acquired” page rank. By making M large, we can make y as large as we want. 74
  • 75. Detecting Link-Spam Topic-specific PageRank, with a set of “trusted” pages as the teleport set is called TrustRank. Spam Mass = (PageRank – TrustRank)/PageRank. High spam mass means most of your PageRank comes from untrusted sources – you may be link-spam. 75
  • 76. Picking the Trusted Set Two conflicting considerations: Human has to inspect each seed page, so seed set must be as small as possible. Must ensure every “good page” gets adequate TrustRank, so all good pages should be reachable from the trusted set by short paths. 76
  • 77. Approaches to Picking the Trusted Set 1. Pick the top k pages by PageRank. It is almost impossible to get a spam page to the very top of the PageRank order. 2. Pick the home pages of universities. Domains like .edu are controlled. 77