SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
1
Contents

I Theory                                                                               5
    0.1   Why SEO is important . . . . . . . . . . . . . . . . . . . . . . . .          5
    0.2   Dierent needs from SEO        . . . . . . . . . . . . . . . . . . . . . .    5


1   What is a Search Engine?                                                           7
    1.1   History of Search Engines      . . . . . . . . . . . . . . . . . . . . . .    7
    1.2   Important Issues     . . . . . . . . . . . . . . . . . . . . . . . . . . .    8
          1.2.1   Performance      . . . . . . . . . . . . . . . . . . . . . . . . .    8
          1.2.2   Dynamic Data       . . . . . . . . . . . . . . . . . . . . . . . .    8
          1.2.3   Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . .     8
          1.2.4   Spam and Manipulation          . . . . . . . . . . . . . . . . . .    8
    1.3   How a Search Engine works . . . . . . . . . . . . . . . . . . . . .           9
          1.3.1   Text acquisition     . . . . . . . . . . . . . . . . . . . . . . .   10
          1.3.2   Duplicate Content Detection        . . . . . . . . . . . . . . . .   10
          1.3.3   Text transformation      . . . . . . . . . . . . . . . . . . . . .   11
          1.3.4   Index Creation     . . . . . . . . . . . . . . . . . . . . . . . .   12
          1.3.5   User Interaction     . . . . . . . . . . . . . . . . . . . . . . .   12
          1.3.6   Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . .      12
          1.3.7   Evaluation     . . . . . . . . . . . . . . . . . . . . . . . . . .   12


2   How good can a search engine be?                                                   13
    2.1   NP Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . .         13
    2.2   AI Hard Problems       . . . . . . . . . . . . . . . . . . . . . . . . . .   14
    2.3   Competitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      15


3   Ranking Factors                                                                    15
    3.1   On Page Factors      . . . . . . . . . . . . . . . . . . . . . . . . . . .   16
    3.2   O Page Factors      . . . . . . . . . . . . . . . . . . . . . . . . . . .   17
    3.3   Google PageRank Notes . . . . . . . . . . . . . . . . . . . . . . .          18
          3.3.1   Short Description . . . . . . . . . . . . . . . . . . . . . . .      19
          3.3.2   Mathematical Description . . . . . . . . . . . . . . . . . .         19
          3.3.3   Interesting Notes on the Original Implementation of PageR-
                  ank    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   20
          3.3.4   Optimal Linking Strategies       . . . . . . . . . . . . . . . . .   21
          3.3.5   Implementation to make computing PageRank faster . . .               23
          3.3.6   HITS     . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   23
          3.3.7   Is linking out a good thing? . . . . . . . . . . . . . . . . .       23
          3.3.8   TrustRank / Bad Page Rank          . . . . . . . . . . . . . . . .   24
          3.3.9   Improvements to Google's ranking algorithms . . . . . . .            25




                                            2
4   Detecting Spam and Manipulation                                                   27
    4.1   Google Webmaster Guidelines . . . . . . . . . . . . . . . . . . . .         27
    4.2   Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   28
    4.3   Detecting Manipulation in Content . . . . . . . . . . . . . . . . .         28
    4.4   Detecting Manipulation in Links       . . . . . . . . . . . . . . . . . .   28
    4.5   Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .       29




II Practice                                                                           29
5   An Example Campaign                                                               30
    5.1   Company Prole      . . . . . . . . . . . . . . . . . . . . . . . . . . .   30
    5.2   Goals   . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   30
    5.3   Competitor Research . . . . . . . . . . . . . . . . . . . . . . . . .       30
    5.4   Keyword Research      . . . . . . . . . . . . . . . . . . . . . . . . . .   30
    5.5   Content Creation . . . . . . . . . . . . . . . . . . . . . . . . . . .      31
    5.6   Website Check     . . . . . . . . . . . . . . . . . . . . . . . . . . . .   31
    5.7   Link Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     31
    5.8   Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    31




                                           3
Preface
This book aims to provide a general overview of how search engines rank doc-
uments in practice, the core of which will remain true even as Search Engine's
algorithms are rened.




                                      4
Part I
Theory
0.1 Why SEO is important
   ˆ A higher search engine result will receive exponentially greater
      clicks than a lower one

For example, if a search was repeated 1000 times by dierent users, this is
typically how many clicks each result would get.


                               Position         Clicks
                                      1          222
                                      2           63
                                      3           45
                                      4           32
                                      5           26
                                      6           21
                                      7           18
                                      8           16
                                      9           15
                                      10          16


   Source: Leaked Aol Click Data

   ˆ Paid adverts have low click through rates, and get expensive
      quickly

 Search Engine        % Organic Click Through Rate           % Paid Result Click Through Rate
      Google                               72                                      28
      Yahoo                                61                                      39
       MSN                                 71                                      29
       AOL                                 50                                      50
     Average                               63                                      37
   88% of online search dollars are spent on paid results, even though 85% of
searchers click on organic results.
    Vanessa Fox, Marketing in the Age of Google, May 3, 2010


0.2 Dierent needs from SEO
There are many dierent reasons you may wish to engage in optimising your
search results, including

   ˆ Money      - Sales for e-commerce sites are directly correlated with trac.

   ˆ Reputation      - Some companies go to the extent of pushing negative arti-
      cles down in the rankings.



                                            5
ˆ Branding    - Coming up top in the results pages is impressive to customers,
 and is particularly important in industries where reputation is extremely
 important.




                                    6
1    What is a Search Engine?
1.1 History of Search Engines
The rst mechanised information retrieval sysyems were built by the US military
to analyse the mass of documents being captured from the Germans. Research
was boosted when the UK and US governments funded research to reduce a
perceived science gap with the USSR. By the time the internet was becoming
commonplace in the early 1990s information retrieval was at an advanced stage.
Complicated methods, primarily statistical, had been developed an archives of
thousands of documents could be searched in seconds.
    Web search engines are a special case of information retrieval systems, ap-
plied to the massive collection of documents available on the internet. A typical
search engine in 1990 was split into two parts: a web spider that traverses the
web following links and creating a local index of the pages, then traditional in-
formation retrieval methods to search the index for pages relevant to the users
query and order the pages by some ranking function. Many factors inuence a
person's decision about what is relevant, such as the current task, context and
freshness.
    In 1998 pages were primarily ranked by their contextual content. Since this
is entirely controlled by the owner of the page, results were easy to manipulate
and as the Internet became ever more commercialized the noise from spam in
SERP's (search engine results pages) made search a frustrating activity. It was
also hard to discern websites which more people would want to visit, for example
a celebrities ocial home page, from less wanted websites with similar content,
for example a site.     For these reasons directory sites such as Yahoo were still
popular, despite being out of date and making the user work out the relevance
    Google's founders Larry Page and Sergey Brin's Page Rank innovation (named
after Larry Page), and that of a similar algorithm also released in 1998 called
Hyperlink-induced Topic Search (HITS) by Jon Kleinberg, was to use the addi-
tional meta information from the link structure of the Internet. A more detailed
description of Page Rank will follow in [chapter], but for now Google's own de-
scription will suce.
    PageRank relies on the uniquely democratic nature of the web by using its
vast link structure as an indicator of an individual page's value.    In essence,
Google interprets a link from page A to page B as a vote, by page A, for page
B. But, Google looks at more than the sheer volume of votes, or links a page
receives; it also analyzes the page that casts the vote. Votes cast by pages that
are themselves important weight more heavily and help to make other pages
important.
    Whilst it is impossible to know how Google has evolved their algorithms
since the 1998 paper that launched page rank, and how real world ecient
implementation diers from the theory, as Google themselves say the PageRank
algorithm remains the heart of Google's software ... and continues to provide
the basis for all of [their] web search tools.   The search engines continue to
evolve at a blistering pace, improving their ranking algorithms (Google says




                                          7
1
there are now over 200 ranking factors considered for each search ), and indexing
a growing Internet more rapidly.


1.2 Important Issues
The building of a system as complex as a modern search engine is all about
balancing dierent positive qualities. For example, you could eectively prevent
low quality spam by paying humans to review every document on the web,
but the cost would be immense. Or you could speed up your search engine by
considering only every other document your spider encounters, but the relevance
of results would suer. Some things, such as getting a computer to analyse a
document to with the same quality as a human, are theoretically impossible
today, but Google in particular is pushing boundaries and getting ever closer.
   Search engines have some particular considerations:


1.2.1     Performance

The response time to a user's query must be lightening fast.


1.2.2     Dynamic Data

Unlike a traditional information retrieval system in a library the pages on the
Internet are constantly changing.


1.2.3     Scalability

Search engines need to work with billions of users searching through trillions of
documents, distributed across the Earth.


1.2.4     Spam and Manipulation

Actively engaging against other humans to maintain the relevancy of results is
relatively unique to search engines. In a library system you may have an author
that creates a long title packed with words their readers may be interested in,
but that's about the worst of it. When designing your search engine you are
in a constant battle with adversaries who will attempt to reverse engineer your
algorithm to nd the easiest ways to aect your restyles.        A common term
for this relation ship is Adverse rial Information Retrieval. The relationship
between the owner of a Web site trying to rank high on a search engine and the
search engine designer is an adversarial relationship in a zero-sum game. That
is, assuming the results were better before, every gain for the web site owner is a
loss for the search engine designer. Classifying where your eorts cross helping a
search engine be aware of your web site's content and popularity, which should
help to improve a search engine's results, and start instead ranking beyond
your means and start decreasing the quality of a search engine's results can be

  1 See                       http://googlewebmastercentral.blogspot.com/2008/10/
good-times-with-inbound-links.html


                                        8
somewhat tricky. The practicalities of what search engines consider to be spam,
and as importantly what they can detect and x, will be discussed later.
                                                  2
   According to Web Spam Taxonomy , approximately 10-15% of indexed
content on the web is spam. What is considered spam and duplicate content
varies, which makes this statistic hard to verify.          There is a core of about 56
million pages
                 3 that are highly interlinked at the center of the Internet, and are
less likely to be spam. Document's further away (in link steps) from this core
are more likely be spam.
   Deciding the quality of a document well (say whether it is a page written
by an expert in the eld, or generated by a computer program using natural
language processing) is an AI Complete problem, that is it won't be possible
until we have articial intelligence that can match that of a human.
   However, search engines hope to get spam under control by lessening the
nancial incentive of spam. This quote from a Microsoft Research paper
                                                                                       4 ex-
presses this nicely:


            Eectively detecting web spam is essentially an arms race be-
      tween search engines and site operators.           It is almost certain that
      we will have to adapt our methods overtime, to accommodate for
      new spam methods that the spammers use. It is our hope that our
      work will help the users enjoy a better search experience on the
      web.Victory does not require perfection, just a rate of detec-tion that
      alters the economic balance for a would-be spammer. It is our hope
      that continued research on this front can make eective spam more
      expensive than genuine content.

                                                                               5
Google developers for their part describe web spam as the following , citing the
detrimental impact it has upon users
   These manipulated documents can be referred to as spam.                   When a user
receives a manipulated document in the search results and clicks on the link to
go to the manipulated document, the document is very often an advertisement
for goods or services unrelated to the search query or a pornography website
or the manipulated document automatically forwards the user on to a website
unrelated to the user's query.


1.3 How a Search Engine works
A typical search engine can split into two parts: Indexing, where the Internet is
transformed into an internal representation that can be eciently searched. The
query process, where the index is searched for the user query and documents
are ranked and returned to the user in a list.
   Indexing

  2 Zoltán Gyöngyi and Hector Garcia-Molina,   Stanford University. First International Work-
shop on Adversarial Information Retrieval on the Web, May 2005
  3 See   On Determining Communities in the Web by K Verbeurg
  4 See   Detecting Spam Web Pages through Content Analysis by A Ntoulas
  5 See   patent 7302645: Methods and systems for identifying manipulated articles




                                              9
1.3.1     Text acquisition

A crawler starts at a seed site such as the DMOZ directory, then repeatedly
follows links to nd documents across the web, storing the content of the pages
and associated meta data (such as the date of indexing, which page linked to the
site). In a modern search engine the crawler is constantly running, downloading
thousands of pages simultaneously, to continuously update and expand the in-
dex. A good crawler will cover a large percentage of the pages on the Internet,
and visit popular pages frequently to keep its index fresh. A crawler will connect
to the web server and use a HTTP request to retrieve the document, if it has
changed. On average, Web page updates follow the Poisson distribution - that is
the crawler can expect the time until the web page updates next time to follow
an exponential distribution. Crawlers are now also indexing near real time data
through varying sources such as access to RSS Feeds and the Twitter API, and
are able to index a range of formats such as PDF's and Flash. These formats
are converted into a common intermediate format such as XML. A crawler can
also be asked to update its copy of a page via methods such as a ping or XML
sitemap, but the update time will still be up to the crawler. The document data
store stores the text and meta data the crawler retrieves, it must allow for very
fast access to a large amount of documents. Text can be compressed relatively
easily, and pages are typically indexed by a hash of their URL. Google's original
patent used a system called BigTable, Google now keeps documents in sections
called shards distributed over a range of data centres (this oers performance,
redundancy and security benets).


1.3.2     Duplicate Content Detection

Detecting exact duplicates is easy, remove the boilerplate content (menus etc.)
then compare the core text through check sums. Detecting near duplicates is
harder, particularly if you want to build an algorithm that is fast enough to
compare a document against every other document in the index. To perform
faster duplicate detection, nger prints of a document are taken.
   A simple ngerprinting algorithm for this is outlined here:


  1. Parse the document into words, and remove formatting content such as
        punctuation and HTML tags.


  2. The words are grouped into groups of words (called n-grams, a 3-gram
        being 3 words, 4-gram 4 words etc.)


  3. Some of these n-grams are selected to represent a document


  4. The selected n-grams are hashed to create a shorter description


  5. The hash values are stored in a quick look up database


  6. The documents are compared by looking at overlaps of ngerprints.




                                       10
Fingerprinting in action

   A paper
               6 by four Google employees found the following statistics across their
index of the web.
   Number of tokens: 1,024,908,267,229
   Number of sentences: 95,119,665,584
   Number of unigrams: 13,588,391
   Number of bigrams: 314,843,401
   Number of trigrams: 977,069,902
   Number of fourgrams: 1,313,818,354
   Number of vegrams: 1,176,470,663
   Most common trigram in English: all rights reserved
   Detecting unusual patterns of n-grams can also be used to detect low qual-
ity/spam documents .
                         7


1.3.3      Text transformation

Tokenization is the process of splitting a series of characters up into separate
words. These tokens are then parsed to look for tokens such as a /a to
nd which parts of the text is plain text, links and such.

   ˆ Identifying Content
Sections of documents that are just content are found, in an attempt to ignore
boiler plate content such as navigation menus. A simple way is to look for
sections where there are few HTML tags, more complicated methods consider
the visual layout of the page.

   ˆ Stopping
Common words such as the and and are removed to increase the eciency
of the search engine, resulting in a slight loss in accuracy. In general, the more
unusual a word the better it is at determining if a document is relevant.
  6 See   N-gram Statistics in English and Chinese: Similarities and Dierences
  7 See   http://www.seobythesea.com/?p=5108




                                              11
ˆ Stemming
Stemming reduces words to just their stem, for example computer and com-
puting become comput.           Typically around a 10% improvement is seen in
relevance in English, and up to 50% in Arabic. The classic stemmer algorithm
is the Porter Stemmer which works through a series of rules such as replace
sses with ss to stresses - stress.


    ˆ Information Extraction
Trying to determine the meaning of text is very dicult in general, but certain
words can give clues.       For example the phrase x has worked at y is useful
when building an index of employees.


1.3.4       Index Creation

Document statistics such as the count of words are stored for use in ranking
algorithms.
                               8 is created to allow for fast full text searches.
                 An inverted index
                                                                       9
The index is distributed across multiple data centres across the globe .


1.3.5       User Interaction

The user is provided with an interface in which to give their query. The query
is then transformed, using similar techniques to with documents such as stem-
ming, as well as spell checking and expanding the query to nd other queries
synonymous with the users query. After ranking the document set, a top set of
results are displayed together with snippets to show how they were matched.


1.3.6       Ranking

A scoring function calculates scores for documents. Some parts of the scoring
can be performed at query time, others at document processing time.


1.3.7       Evaluation

Users queries and their actions are logged in detail for improve results.                 For
example, if a user clicks on a result then quickly performs the same search
again, it is likely that they clicked a poor result.

   8 An   inverted index is an index data structure storing a mapping from content, such as
words or numbers, to its document in a set of documents. The purpose of an inverted index
is to allow fast full text searches, at a cost of increased processing when a document is added
               http://en.wikipedia.org/wiki/Inverted_index
to the database.
   9A                                  approach is at http://highscalability.com/
          good overview of Google's shard
google-architecture




                                              12
2       How good can a search engine be?
There are some very specic limits in computer science as to what a computer
program is capable of doing, and these have direct consequences for how search
engines can index and rank your web pages.          The two core sets or problems
are NP-Complete problems, which for large sets of data take too long to solve
perfectly, and AI-Complete problems, which can't be done perfectly until we
have computers that are intelligent as people. That doesn't mean search engines
can't make approximations, for example nding the shortest route on a map is
a NP-Complete problem yet Google maps still manages to plot pretty good
routes
         10 .


2.1 NP Hard Problems
Polynomial (P) problems can be solved in polynomial time, that is relatively
quickly. Non Polynomial (NP) problems cannot be solved in polynomial time,
that is they can't be solved for any reasonably large set of inputs such as a
number of web pages.




    The time taken to solve the NP hard problem (in red) grows extremely
quickly as the size of the problem grows.

    These concepts become complex quickly, but the key thing to pick up is
that if a problem is NP Hard there is no way it can ever be solved perfectly for
something as large as a search engines index, and approximations will have to
be used. There are some NP Hard problems that are of particular interest to
SEO:


    ˆ   The Hamiltonian Path Problem - Detecting a greedy network (IE if you
        interlink your web pages to hoard page rank) in the structure of a Hamil-
        tonian path
                      11 is an NP hard problem

    ˆ   Detecting Page Farms (the set of pages that link to a page) is NP hard
                                                                               12

 10 http://www.youtube.com/watch?v=-0ErpE8tQbw
 11 http://en.wikipedia.org/wiki/Hamiltonian_path
 12 See Sketching Landscapes of Page Farms by Bin Zhou   and Jian Pei




                                         13
ˆ   Detecting Phrase Level Duplication in a Search Engine's Index
                                                                              13


2.2 AI Hard Problems
AI Hard problems require intelligence matching that of a human being to be
solved. Examples include the Turing Test (tricking a human into thinking they
are talking to a human, not a computer), recognising dicult CAPTCHA's and
translating text as well as an expert (who wouldn't be perfect either).
   During a question-and-answer session after a presentation at his alma mat-
ter,Stanford University, in May 2002, Page said that Google would full its mis-
sion only when its search engine was AI-complete, and said something similar
in an interview with Newsweek then Playboy.
   I think we're pretty far along compared to 10 years ago, he said. At the
same time, where can you go? Certainly if you had all the world's information
directly attached to your brain, or an articial brain that was smarter than your
brain,you'd be better. Between that and today, there's plenty of space to cover.
What would a perfect search engine look like? we asked. It would be the mind
of God
          14
   And, actually, the ultimate search engine, which would understand, you
know, exactly what you wanted when you typed in a query, and it would give
you the exact right thing back, in computer science we call theatrical intelli-
gence. That means it would be smart, and we're a long way from having smart
computers.
               15
   Of particular interest to SEO is that fully understanding the meaning of
human text is an AI complete problem, and even getting close to understanding
words in context is very dicult
                                     16 . This means detecting the quality of reason-
able quality computer generated text against that of a human expert automat-
ically is tricky.   Its not unusual to see websites packed with decent computer
generated text (which automatically detecting is an AI complete problem) and
single phrases stitched together from a variety of sources (which is an NP com-
plete problem) ranking for Google Trends results. This is particularly hard to
stop as for new news items there are less fresh sources available to choose from,
this results in search engine poisoning
                                            17 . Any site that receives a large amount
of trac from this will eventually be visited manually by a Google employee,
and penalised manually
                            18 .
   Google's solution to the very similar machine translation problem is inter-
esting; rather than attempting to build AI they use their massive resources and
data stored from web pages and user queries to build a reliable statistical engine

 13 See   Detecting phrase-level duplication on the world wide web by Microsoft Research
employees
 14 http: // searchenginewatch. com/ 2156601
 15 http: // tech. fortune. cnn. com/ 2011/ 02/ 17/ is-something-wrong-with-google/
 16 http://en.wikipedia.org/wiki/Natural_language_understanding
 17 http://igniteresearch.net/spam-in-poisoned-world-cup-results/
 18 http://www.google.co.uk/search?q=Google+Spam+Recognition+Guide+for+Quality+
Rater




                                            14
- their approach isn't necessarily far smarter than their competitors but their
resources make them the best translator out there.



2.3 Competitors
Although not a classic computer science problem, a big limit to how search
engines can treat possible spam is that competitors could attempt to make your
website look like it was spamming to lower your ranking, increasing theirs. For
example, if your website suddenly receives and inux of low quality links from
sites known to ink to spam, how would Google know if you naively ordered this
or a competitor did?
    This is an unsolvable problem, short of non-stop surveillance of all website
owners. This is what Google has to say on the matter
                                                         19
    There's   almost   nothing a competitor can do to harm your ranking or have
your site removed from our index. If you're concerned about another site linking
to yours, we suggest contacting the webmaster of the site in question. Google
aggregates and organizes information published on the web; we don't control the
content of these pages.
    I can say from experience that Google bowling most certainly does happen,
and there are a couple of experiments written up on the web
                                                               20 , though it would
be very dicult to Google bowl a popular website. Essentially, if a small per-
centage of links to a site are most likely spam they are just ignored, if a large
percentage are likely spam then the links may result in a penalty rather than
just being ignored.
    It seems likely that poor quality links are increasingly being ignored. The
paper Link Spam Alliances from Stanford, the Google founder's Alma mater,
discusses both dated methods of detecting and punishing potential link spam.
Note that link spam isn't the only way that sites can potentially be Google
bowled, if your competitor lls your comment section with duplicate content
about organ enlargement and links to known phishing sites it is unlikely to help
your rankings. Google now also takes into account users choosing to block sites
from results
               21 , presumably with a negative eect.



3     Ranking Factors
Google engineers update their algorithms daily
                                                  22 . They then run many tests to
check they have the right balance between all these factors.
    The following is from an interview with Google's Udi Manber.
    Q: How do you determine that a change actually improves a set of results?
    A: We ran over 5,000 experiments last year. Probably 10 experiments for ev-
ery successful launch. We launch on the order of 100 to 120 a quarter. We have
 19 http://www.google.com/support/webmasters/bin/answer.py?answer=34449
 20 http://bit.ly/jEKzMa
 21 http://googlewebmastercentral.blogspot.com/2011/04/high-quality-sites-algorithm-goes.
html
  22 http://www.nytimes.com/2007/06/03/business/yourmoney/03google



                                         15
dozens of people working just on the measurement part. We have statisticians
who know how to analyze data, we have engineers to build the tools. We have at
least 5 or 10 tools where I can go and see here are 5 bad things that happened.
Like this particular query got bad results because it didn't nd something or the
pages were slow or we didn't get some spell correction.
   I have created a spreadsheet that shows how a search engine may cal-
culate the ranking of a trivial set of documents for a particular query, you
                                       http://igniteresearch.net/
can view it and try changing things yourself at
poodle-a-simple-emulation-of-search-engine-ranking-factors/.

3.1 On Page Factors
   ˆ Keywords
Repetitions of the words in the query in the document, particularly in key areas
such as the title and headers are positive signals of relevance. The proximity
of the words together is important, particularly having the exact query in the
document. A very large repetition, particularly in nongrammatical sentences,
can be a negative signal of spam. Presence of the query words in the Domain
and URL are useful signals of relevance.            Related phrases to the query are
also positive signals of relevance (see Latent Semantic Indexing).        The meta
keywords HTML tag, meta name=keywords content=my, keywords, is
largely ignored by modern search engines
                                             23 .

   ˆ Quality
A number of dierent authors on a website, good grammar, spelling and long
pages written at reasonable time intervals are positive signs of high quality
content
          24 .

   ˆ Geographical Locality
Mentions of an address close the user show the document may be geographically
relevant to the user, particularly for geograpihcally sensitive queries such as
plumbers in london.

   ˆ Freshness
For time dependant queries, such as news events, recent pages are more likely
to be helpful to the user. See Google's Quality Deserves Freshness drive, of
which Google's faster indexing Caeine update was a part.

   ˆ Duplicate Content
Large percentages of content duplicated either from the same site, or others is
an indicator of poor quality content and users will only want to see the canonical
copy.

  23 See                        http://googlewebmastercentral.blogspot.com/2009/09/
google-does-not-use-keywords-meta-tag.html
  24 See http://www.seobythesea.com/?p=541



                                        16
ˆ Adverts
A very large number of adverts can reduce the user experience, and aliate
links are often associated with heavily SEO manipulated websites.

   ˆ Outbound Links
Links to spammy of phising websites, or an unusually large number of outbound
links on a number of pages, are common indicators of a page that users will not
want to visit
                25 .

   ˆ Spam
An unusual repetition of keywords, particularly outside of sentences is a sign
of spam.     Techniques such as hidden text and sneaky javascript redirects are
relatively easy to detect and punish.



3.2 O Page Factors
   ˆ Site Reliability
Unreliable or slow sites provide a poor user experience, and so will have a penalty
applied. You can be warned if this happens if you sign up for Google webmaster
tools
     26 .

   ˆ Popularity of the Site
From aggregated ISP data that search engine's buy and search trac
                                                                                27 .

   ˆ Incoming Links/ PageRank
The link structure of the internet is a useful pointer of a websites popularity.
Anchor text on incoming links related to query shows a search engine the page
is related to the query. Links they remain for a long time from sites that have
many links pointing to themselves are rated highly. Links that are in boiler plate
areas or sitewide may be ignored. Links that are all identical in anchor text (ie
blatantly machine generated), from spammy websites (bad neighbourhoods
                                                                                          28 ),
thought to be paid for with the intention of manipulating rankings or spam can
result in penalties.     Links from sites that are most likely owned by the same
owner, detected either from Whois data or if the sites are hosted within the
same Class C IP, are likely considered less reliable signals of importance.                 A
normal rate of growth of incoming links, as opposed to bursty start stops
                                                                                       29 that
indicate link building campaigns
                                       30 .

 25 See   Improving Web Spam Classiers Using Link Structure for a very interesting Yahoo
patent on detecting spam based on the number of inbound and outbound links
 26 See http://www.mattcutts.com/blog/site-speed/
 27 See http://trends.google.com/websites?q=bing.comgeo=alldate=all         and      http://
www.compete.com
  28 See http://www.google.com/support/webmasters/bin/answer.py?answer=35769
  29 See http://www.seobook.com/link-growth-profile
  30 See http://www.wolf-howl.com/seo/google-patent-analysis/



                                              17
ˆ Other indirect signals of a website's popularity
Other data can include mentions in chats, emails and social networks.


       ˆ Links from trusted websites
The proximity on web graph to important, trusted sites (Links from old, high
page rank websites at the centre of the old heavily interconnected internet are
useful signals that a website can be trusted and is important
                                                                    31 ).

       ˆ Links from other sites that rank for the query
Results may be reordered based on how they link to each other.


       ˆ Geographical Location
If the geographical location of server, website according to directories, top level
domain or location as set in Google Webmaster Tools match that of the user
it is a signal that the page will be more relevant to the user, particularly for
location sensitive searches.


       ˆ User Click Data
If users often search again after clicking on the sites result that is an indicator
that the page is not a good match for the query. The personal history of results
clicked, and pattern of related searches may help indicate what a user is looking
for
      32 .

       ˆ Domain Information
Older domains are likely trusted more. Google is a domain registrar so has ex-
tensive information Whois Information, and validates that address information
associated with domains is correct.


       ˆ Manul Reviews
Google Quality Raters
                        33 manually reviewing websites and tagging them as cat-
egories such as essential to query, not relevant to query, spam.



3.3 Google PageRank Notes
Google's PageRank was the innovation that propelled Google to the top of
the search engine pile. Whilst its implementation has changed much since its
original description, and many other factors are now taken into account, it is
still at the heart of modern search engines so some extra notes will be made on
it here.
 31 See http://www.touchgraph.com/seo and type in http://www.nasa.gov for a visual graph
 32 See Seehttp://www.seobythesea.com/?p=334
 33 See http://searchengineland.com/the-google-quality-raters-handbook-13575




                                          18
3.3.1      Short Description

The key point is that PageRank considers each link a vote, and links from pages
which have many links themselves are considered more important. Or as Google
puts it:
      PageRank reects our view of the importance of web pages by considering
more than 500 million variables and 2 billion terms. Pages that we believe are
important pages receive a higher PageRank and are more likely to appear at the
top of the search results. PageRank also considers the importance of each page
that casts a vote, as votes from some pages are considered to have greater value,
thus giving the linked page greater value.


3.3.2      Mathematical Description

Its not essential to have a mathematical understanding of how PageRank is cal-
culated, but for those familiar with basic graph theory and algebra it is useful.
You may wish to skip this section, and read a slightly less mathematical de-
       34 . For a more complete treatment of the mathematics see the original
scription
PageRank paper
                  35 , the Deeper Inside PageRank by Amy N. Langvilleand
and Carl D, and this thesis
                              36 . The following is summarised from Sketching
Landscapes of Page Farms
                          37 by Bin Zhou and Jian Pei:
      The Web can be modeled as a directed Web graph G = (V, E), where V is


                                           
the set of Web pages, and E is the set of hyperlinks. A link from page p to page
q is denoted by edge p       q. An edge p       q can also be writte nas a tuple (p,
q).
      PageRank measues the importance of a page p by considering how collec-
tively other Web pages point to p directly or indirectly. Formally, for a Web
page p, the PageRank score is dened as:




      Where M(p) = { q| q       p } is the set of pages having a hyperlink point
to p, OutDeg(pi ) is the out-degree of pi (i.e., the number of hyperlinks from
pi pointing to some pages other than pi ), and d is a damping factor (0.85 in
the original PageRank implementation) which models the random transitions of
the web. If a damping factor of 0.5 is used then at each page there is a 50/50

  34 See the introductions of http://www.sirgroane.net/google-page-rank/, http://www.
webworkshop.net/pagerank.html or the Wikipedia article
  35 At http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf
  36 http://web.engr.oregonstate.edu/~sheldon/papers/thesis.pdf
  37 See http://www.cs.sfu.ca/~bzhou/personal/paper/sdm07_page_farm.pdf




                                         19
chance of the surfer clicking a link, or jumping to a random page on the internet.
Without the damping factor the PageRank of any page with an outgoing link
would be 0.
   To calculate the PageRank scroes for all pages in a graph, one can assign a
random PageRank score value to each node in the graph, then apply the above
equation iteratively until the PageRank scroes in the graph converge.
   The google toolbar is a logarithmic scale out of 10, not the actual internal
data. For example:
        Domain        Calculated PageRank          PageRank displayed in Toolbar
        small.com                 47                                  2
    medium1.com                 54093                                 5
    medium2.com                 84063                                 5
        big.com                1234567                                7
        big2.com               2364854                                7


3.3.3    Interesting Notes on the Original Implementation of PageR-
         ank

From PageRank Uncovered
                              38 , essential reading for those looking to understand
PageRank from an SEO perspective:


   ˆ PageRank is a multiplier, applied after relevant results are found
Remember, PageRank alone cannot get you high rankings. We've mentioned
before that PageRank is a multiplier; so if your score for all other factors is 0
andyour PageRank is twenty billion, then you still score 0 (last in the results).
This isnot to say PageRank is worthless, but there is some confusion over when
PageRank is useful and when it is not. This leads to many misinterpretations
of its worth. The only way to clear up these misinterpretations is to point out
when PageRank is not worth while.If you perform any broad search on Google, it
will appear as if you've found several thousand results. However, you can only
view the rst 1000 of them.    Understanding why this is so, explains why you
should always concentrate on  on the page factors and anchor text rst, and
PageRank last.

   ˆ Each page is born with a small amount of PageRank
A page that is in the Google index has a vote, however small. Thus, the more
pages you have in the index  the more overall vote you are likely to have.
Or,simply put, bigger sites tend to hold a greater total amount of PageRank
within their site (as they have more pages to work with).
   Note that Google's original algorithm has most likely been amended since
to detect and reduce page rank hoarding, and generating PageRank by massive
interlinking on auto generated pages. Also for quicker calculations an approx-

 38 See http://www.bbs-consultant.net/IMG/pdf_PageRank.pdf




                                         20
imation of PageRank which only gives certain seed pages PageRank may be
used
       39 .
   Interestingly, however, there are examples of this working, see How to get
billions of pages indexed in Google at              http://www.threadwatch.org/node/
6999.         In a related issue, at one point 10% of MSN Search's (now known as
Bing) German index was computer generated content on a single domain
                                                                                          40 .


3.3.4         Optimal Linking Strategies

Deciding how to interlink pages that you own or have inuence over is tricky;
interlinking can be a good signal that that pages are related and on a certain
topic, build PageRank and control PageRank ow. However, heavily interlinking
can be a signal of manipulation and spam, and dierent linking structures can
make dierent sites in your possession rank higher. The mathematics gets tricky
fast, here is a quick overview of the literature today:

   ˆ Note from Web Spam Taxonomy
Though written about Spam farms, the math holds true for good commercial
sites too.       Essentially this states that maximum page rank for a target page
is achieved by linking only to the target page from forums, blogs etc.                     then
interlinking the network of sites owned (as if there are no outlinks on a page the
random surfer will jump to a random page on the Internet).




   1. Inaccessible pages are those that a spammer cannot modify. These are
the pages out of reach; the spammer cannot inuence their outgoing links. (Note
that a spammer can still point to inaccessible pages.)
   2. Accessible pages are maintained by others (presumably not aliated with
the spammer), but can still be modied in a limited way by a spammer.                        For
example, a spammer may be able to post a comment to a blog entry, and that
comment may contain a link to a spam site.
   3. Own pages are maintained by the spammer, who thus has full control over
their contents.
   We can observe how the presented structure maximizes the total PageRank
score of the spam farm, and of page t in particular:
   1. All available n own pages are part of the spam farm, maximizing the static
score total PageRank.
   2. All m accessible pages point to the spam farm, maximizing the incoming
score incoming PageRank.

 39 For       more   on   why   this   shouldn't   work   see   http://www.pagerank.dk/Pagerank/
Generate-pagerank.htm
 40 See   http://research.microsoft.com/pubs/65144/sigir2005.pdf




                                                   21
3. Links pointing outside the spam farm are suppressed, making PRout out-
going PageRank zero.
   4.   All pages within the farm have some outgoing links, rendering a zero
PRsink score component.
   Within the spam farm, the the score of page t is maximal because:
   1. All accessible and own pages point directly to the target, maximizing its
incoming score PRin (t).
   2. The target points to all other own pages. Without such links, t would had
lost a signicant part of its score (PRsink (t)  0), and the own pages would had
been unreachable from outside the spam farm. Note that it would not be wise to
add links from the target to pages outside the farm, as those would decrease the
total PageRank of the spam farm.

   ˆ From Link Spam Alliances
The analysis that we have presented show how the PageRank of target pages can
be maximized in spam farms. Most importantly, we nd that there is an entire
class of farm structures that yield the largest achievable target PageRank score.
All such optimal farm structures share the following properties:

  1. All boosting pages point to and only to the target.

  2. All hijacked point to the target.

  3. There are some links from the target to one or more boosting pages.

   ˆ From Maximizing PageRank via Outlinks
In this paper we provide the general shape of an optimal link structure for a
website in order to maximize its PageRank. This structure with a forward chain
and every possible backward link may be not intuitive.      At our knowledge, it
has never been mentioned, while topologies like a clique, a ring or a star are
considered in the literature on collusion and alliance between pages. Moreover,
this optimal structure gives new insight into the armation of Bianchini et al.
that, in order to maximize the PageRank of a website, hyperlinks to the rest
of the webgraph should be in pages with a small PageRank and that have many
internal hyperlinks. More precisely, we have seen that the leaking pages must be
chosen with respect to the mean number of visits before zapping they give to the
website, rather than their PageRank.

   ˆ From The eect of New Links on PageRank by Xie
Theorem: The optimal linking strategy for a Web page is to have only one out-
going link pointing to a Web page with a shortest mean rst passage time back
to the original page.
   Conclusions: .... We conclude that having no outgoing link is a bad policy
and that the best policy is to link to pages from the same Web community.
Surprisingly, a new incoming link might not be good news if a page that points
to us gives many other irrelevant links at the same time.
   Reading this paper fully it is only in very particular circumstances that a
new incoming link is not good news.



                                         22
3.3.5    Implementation to make computing PageRank faster

There have been a number of proposed improvements to the original PageRank
algorithm to improve the speed of calculation
                                                         41 , and to adapt it to be better at
determining quality results. No search engine calculates PageRank as shown in
the naive algorithm in the original paper
                                                  42 .


3.3.6    HITS

HITS is another ranking algorithm that takes into account the pattern of links
found throughout the web, and it was released just before PageRank in 1999.
HITS treats some pages on the web as authorities, which are good documents
on a topic, and hubs, which mostly link to authorities.
    A page is given a high authority score by being linked to by pages that are
recognized as Hubs for information. A page is given a high hub score by linking
to nodes that are considered to be authorities on the subject.
    Unlike PageRank, which is query independent and so computed at index-
ing time, HITS hub and author scores are query depend ant and so computed
(though likely cached) at query time.


3.3.7    Is linking out a good thing?

Whilst TEOMA is the only search engine that uses HITS at its core, its think-
ing has heavily inuenced search engine designers - so it is likely that linking
out to high quality authorities can positively inuence either a pages ranking
(though potentially negatively, if designers want authorities rather than hubs to
appear in their results
                           43 ), or the importance of the other links it contains. Many
webmasters fear linking out to sites as they would rather keep links internal to
prevent PageRank owing out (many webmasters also nofollow links to similar
reasons, not that this form of PageRank sculpting no longer works according to
Matt Cutts, Google's head of [anti]web spam).
    Matt Cutts also said a number of years ago:
    Of course, folks never know when we're going to adjust our scoring. It's
pretty easy to spot domains that are hoarding PageRank; that can be just another
factor in scoring.
    Some search engines are even concerned about people linking out too much,
whilst crawlers can now index a large number of links on a page, a very large
number of outbound links often indicates that a site has been hacked with spam
links or is machine generated.
    A spammer might manually add a number of outgoing links to well-known
pages, hoping to increase the page's hub score.                At the same time,the most

  41 For example,   see Computing PageRank using Power Extrapolation and Ecient PageR-
ank Approximation via Graph Aggregation
  42 Matt Cutts discusses a couple of the implementation details at http://www.mattcutts.
com/blog/more-info-on-pagerank/
  43 See http://www.wolf-howl.com/seo/seo-case-study-outbound-links/ and Deeper In-
side PageRank, discussed earlier




                                             23
wide-spread method for creating a massive number of outgoing links is direc-
tory cloning
                44 .


3.3.8      TrustRank / Bad Page Rank

Its likely that after results are generated based on relevance, PageRank is then
applied to help order, then Trust Rank to help order the results. A site may lose
trust every time it fails some kind of spam test (for example if a large number
of reciprocal links are found,cloaking, duplicate content, fake whois data) and
gain Trust for certain properties (domain age, trac, being one a number of
important seed sites that are manually tagged as trusted sites). These initial
Trust Ranks could then be propagated in a similar way to PageRank, so linking
to and from bad neighborhoods would negatively aect the sites Trust Rank
through association
                       45 .




   From SEO By The Sea:
   In 2004, a Yahoo whitepaper was published which described how the search
engine might attempt to identify web spam by looking at how dierent pages
linked to each other. That paper was mistakenly attributed to Google by a large
number of people, most likely because Google was in the process of trademarking
the term TrustRank around the same time, but for dierent reasons. Surpris-
ingly, Google was granted a patent on something it referred to as Trust Rank
in 2009, though the concept behind it was dierent than Yahoo's description of
TrustRank.      Instead of looking at the ways that dierent sites linked to each
other, Google's Trust Rank works to have pages ranked according to a measure
of the trust associated with entities that have provided labels for the documents.

 44 See   Web Spam Taxonomy
 45 See        http://bakara.eng.tau.ac.il/semcomm/GKRT.pdfand         http://www.
freepatentsonline.com/7603350.html and http://www.cs.toronto.edu/vldb04/protected/
eProceedings/contents/pdf/RS15P3.PDF


                                        24
...
   If you've ever heard or seen the phrase TrustRank before, it's possible that
whoever was writing about it, or referring to it was discussing a paper titled Com-
bating Web Spam with TrustRank (pdf ). While the paper was the joint work of
researchers from Stanford University and Yahoo!, many writers have attributed
it to Google since its publication date in 2004 The confusion over who came
up with the idea of TrustRank wasn't helped by Google trademarking the term
TrustRank in 2005. That trademark was abandoned by Google on February
29, 2008, according to the records at the US PTO Tess database. However, a
patent called Search result ranking based on trust deals with something called
trust rank, led on May 9, 2006.
   Google mentions distrust and trust changes as indicators. More than trust
analysis, trust variation analysis is on the road. Fake reviews, sponsored blogs
and e-commerce trust network inuence are pointed out.
   The paper A Cautious Surfer for PageRank comments on why TrustRank
shouldn't be overused:
   However, the goal of a search engine is to nd good quality results; spam-free
is a necessary but not sucient condition for high quality. If we use a trust-based
algorithm alone to simply replace PageRank for ranking purposes, some good
quality pages will be unfairly demoted and replaced, for example, by pages within
the trusted seed sets, even though they may be much less authoritative.Considered
from another angle, such trust-based algorithms propagate trust through paths
originating from the seed set; as a result,some good quality pages may get low
value if they are not well connected to those seeds.


3.3.9      Improvements to Google's ranking algorithms

There have been a number of notable algorithm changes which made consider-
able changes appear to results pages, though often the eects were later scaled
back slightly.

   ˆ NoFollow
Matt Cutts and Jason Shellen created the nofollow specication to help limit
the eect and incentive for blog spam. If a search engine comes across a link
tagged as nofollow, it will not treat the link as a vote, ie as a positive signal in
rankings. Areas where untrusted users can post content are often tagged nofol-
low, roughly 80% of content management systems (the software that websites
run on) implement nofollow.
   The HTML code of a NoFollow link:
   a href=signin.php rel=nofollowsign in/a

   ˆ Increasing use of anchor text
Even the original PageRank algorithm took into account the anchor text of links,
so links were used to give both a number that indicated the sites popularity
and information about the content of a document and so its relevance for user
queries.



                                        25
ˆ Google Bombing Prevention, 2nd February 2007
Google Bombing is the process of massively linking to a page with a specic
anchor text, to give PageRank but more importantly indications that the doc-
ument is related to the anchor text. For example, in 1999 a number of bloggers
grouped together to link to Microsoft.com with the anchor text more evil than
Satan himself. This resulted in Microsoft being placed number one in searches
for more evil than Satan himself despite not having the phrase anywhere on its
page. Detecting a sudden inux of links with identical anchor text is very easy,
and in 2007 Google changed their indexing structure so that Google bombs such
as miserable failure would typically return commentary, discussions, and ar-
ticles about the tactic itself. Matt Cutts said the Google bombs had not been
a very high priority for us.    Over time, we've seen more people assume that
they are Google's opinion, or that Google has hand-coded the results for these
Google-bombed queries. That's not true, and it seemed like it was worth trying
to correct that perception.
                             46 Some Google bombs still work, particularly those
tar getting unusual phrases, with varied anchor text, over a period of time,
within paragraphs of text.


      ˆ Florida, November 2003
Results for highly commercial queries, likely informed from the cost of Adwords,
became heavily ltered so more trusted academic websites and less commercial
optimised websites ranked. Some of these changes resulted in less relevance, for
example if a user was searching for buy bricks they probably didn't want to
mainly see websites about the process of creating bricks, and were rolled back.
For more see
               47 and 48 .

      ˆ Bourbon, June 2005
A penalty was applied to sites with unusually fast or bursty patterns of link
growth.


      ˆ Jagger, October 2005
A penalty applied to sites with unusually large amounts of reciprocal links, new
methods for detecting hidden text.


      ˆ Big Daddy, December 2005
According to Matt Cutts, punished were sites where our algorithms had very
low trust in the inlinks or the outlinks of that site. Examples that might cause
that include excessive reciprocal links, linking to spammy neighborhoods on the
web, or link buying/selling.
                               49

 46 See http://answers.google.com/answers/main?cmd=threadviewid=179922
 47 http://www.searchengineguide.com/barry-lloyd/been-gazumped-by-google-trying-to-make-sense-of-the-florida-upd
php
 48 http://www.seoresearchlabs.com/seo-research-labs-google-report.pdf
 49 See http://www.webworkshop.net/googles-big-daddy-update.html



                                       26
ˆ Caeine, October 2010
A faster indexing system that changed results little, but allowed for fresher
results and some of the later Panda updates
                                                  50 .

    ˆ Panda, April 2011
Penalty applied to content deemed low quality, detected primarily from user
data.   Websites which contained masses of articles, focusing on quantity over
quality, were often hit
                        51 .



4       Detecting Spam and Manipulation
You will often hear that your site has to look natural to the search engines.
Just what natural means is hard to dene, but essentially it means the prole
of a site whose popularity was never engineered or promoted, and was instead
based on people luckily coming across it and deciding to recommend it to their
friends with links. Whats more, you also need to make your site look popular,
creating no links to your site yourself will look natural but you will have
no chance of competing with people who do unless you have the cash to buy
large amounts of advertising.    This section briey covers what search engines
consider to be acceptable, when and how they can detect violations, and what
the potential penalties are.



4.1 Google Webmaster Guidelines
Google have created a page called Webmaster Guidelines to inform users of
what they consider to be acceptable methods of promoting your website. Whilst
the lines for crossing general principles such as Would I do this if search engines
didn't exist?   are somewhat vague, they do oer some specic notes of what
not to do:


    ˆ   Avoid hidden text or hidden links.

    ˆ   Don't use cloaking or sneaky redirects.

    ˆ   Don't send automated queries to Google.

    ˆ   Don't load pages with irrelevant keywords.

    ˆ   Don't create multiple pages, sub domains, or domains with substantially
        duplicate content.

 50 See http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html
 51 See http://blog.searchmetrics.com/us/2011/04/12/googles-panda-update-rolls-out-to-uk/
and        http://www.seobook.com/questioning-questions        and        http://
googlewebmastercentral.blogspot.com/2011/05/more-guidance-on-building-high-quality.
html




                                         27
ˆ   Don't create pages with malicious behavior, such as phishing or installing
       viruses, Trojans, or other bad ware.

   ˆ   Avoid doorway pages created just for search engines, or other cookie
       cutter approaches such as aliate programs with little or no original con-
       tent.

   ˆ   If your site participates in an aliate program, make sure that your site
       adds value. Provide unique and relevant content that gives users a reason
       to visit your site rst.

Most of the methods listed above are naive and easy to detect, Google have
been fairly successful in making successful manipulation aligned with creating
genuine content, though without any promotion it is unlikely even the best
content will be noticed.



4.2 Penalties
Penalties
            52 that Google to detected manipulation vary in length of time and
eect, from small ranking penalties for certain keywords for a page to site
wide bans, depending upon the sophistication of the manipulating methods
and the quality of the oending site. If you believe you had had one applied,
you can submit a Google Reconsideration Requesthttp://www.google.com/
support/webmasters/bin/answer.py?answer=35843 from Google Webmaster
Tools, once you have xed the oending issues.



4.3 Detecting Manipulation in Content
There is a fascinating paper by Microsoft which details a number of methods for
detecting spam pages in search engine index's based on their content. A simple
way is to use Bayesian lters (one is included with Ignite SEO to test your
content as the search engine's would), so for example seeing the phrase buy
pills would be a strong indicator of spam. Most of the research is on detecting
blatantly computer generated lists of keywords, which is fairly easy to detect.
Detecting the quality of human written content is very dicult, so unless you
are endlessly repeating your keywords if you are writing your own content you
can be reasonably happy with its quality in search engine's eyes.
   The following graphs are cut from Detecting Spam Web Pages through
Content Analysis
                    53 by Microsoft Research employees.


4.4 Detecting Manipulation in Links
Much research has focused on detecting spam pages through their backlinks or
outlinks. Yahoo obtained a patent that uses the rate of link growth to detect

  52 http://www.forbes.com/2007/04/29/sanar-google-skyfacet-tech-cx_ag_
0430googhell.html
  53 http://cs.wellesley.edu/~cs315/Papers/Ntoulas-DetectingSpamThroughContentAnalysis.
pdf


                                        28
manipulation.   Essentially a constant rate of new backlinks, perhaps with a
small growth over time, is expected for a typical site. A saw-tooth pattern of
inlinks is a strong indicator of backlink campaigns that start and stop (though
could also be an indicator of say a site that releases new software monthly).




   In their paper, Fetterly et al, analyse the indegree (incoming/backlinks) and
outdegree (links on the page) distributions of web pages:
   Most web pages have in and outdegrees that follow a powerlaw distribution.
Occasionally, however, search engines encounter substantially more pages with
the exact same in or outdegrees than what is predicted by the distribution for-
mula. The authors nd that the vast majority of such outliers are spam pages.
   As discussed in the Trust Rank section earlier, large amount of links from
sites that have already been detected as linking to spam (so called untrustwor-
thy hubs) is a negative indicator.   Links from unrelated websites, reciprocal
links, links out of content, from sites that are known to host paid links and
many other signals are likely taken into consideration.
   Zhang et al have identied a method for identifying unusually highly inter-
connected groups of web pages. More methods of identifying manipulative sites
are listed in Link Spam Alliances by Geyongyi and Garcia-Molina.



4.5 Other Methods
If you think a competitor has been using methods that violate the webmaster
guidelines, you can report them to Google
                                            54 . Its good practice to ensure that
any site you wish to keep for a long time, and expect to get reasonable amounts
of trac,
   Google will sometimes manually review websites without prompting, Google
Quality Raters inspect sites for relevance to results but can also take web pages
as spam. Particular markets are inspected more often than others.



 54 https://www.google.com/webmasters/tools/spamreport?hl=enpli=1




                                       29
Part II
Practice
5    An Example Campaign
Now we've covered the theory, its time for a real world example of putting it
into practice.



5.1 Company Prole
John runs a driving school in Springeld, Ohio. He has a website he has owned
for a couple of years, that ranks around the second page for most searches related
to driving schools in Ohio and receives about 20 visitors day, a third from search
engines and two thirds from links from local websites.
    A quick search for what he imagines would be his main keyword, driving
school Springeld Ohio, has a company directory site at the top followed by
other directories, companies and people asking on forums for recommendations.
This mix of relevant small companies web site's and small pages on big websites
indicates the keyword to be of medium diculty to rank for.



5.2 Goals
John thinks if he can get his site to rank 3rd instead of around the middle of the
second page for his core keywords, he will increase his search trac by around
1000%, his overall trac by about 300%, and roughly double his sales. He aims
to do this over a period of roughly one month.



5.3 Competitor Research
John nds his main competitors by searching, and gets estimates of their trac
sources using sites such as compete.com and serversiders.com.        A tool such
as Ignite SEO can automatically build SEO reports of competitors, listing their
paid and organic keywords, demographics and backlinks. Looking at the HTML
source code of some his competitors displays their targeted keywords in the
meta name=keywords content=keyword1, keyword.



5.4 Keyword Research
John takes his initial guesses of what potential customers might search for,
and those from his competitors and his existing trac, and using the Google
Keyword Tool
                 55 and Google Insights56 expands this list.

 55 https://adwords.google.co.uk/select/KeywordToolExternal
 56 http://www.google.com/insights




                                         30
5.5 Content Creation
John takes his keywords and create a small amount of content on his website
containing them. He then creates a large amount of content quickly and creates
                               57 that, each one targeting a dierent keyword.
sites hosted on free hosting sites
The content generator section of Ignite SEO
                                            58 is perfect for this.


5.6 Website Check
Before investing in o site promotion (ie link building), it is worth performing
a quick check that the site is search engine friendly.   Creating an account in
Google Webmaster Tools will let you know if Google has any issues indexing
your website, and it is worth ensuring navigation isn't over reliant on JavaScript
or Flash.



5.7 Link Building
This is the core process that will actually improve John's rankings. By looking at
his competitors backlinks using Yahoo's linkdomain: command, John replicates
their links to his website by visiting each site one by one.   Using a tool such
as Ignite SEO, he can automatically build links to the hosted sites he quickly
created in 5.5, without the risk of a link campaign negatively aecting the
rankings of his core website.    Other signals of quality such as facebook and
twitter recommendations are built here.



5.8 Analysis
The success of the campaign is measured with a good tracking system such
as Google Analytics, as well as tracking the new incoming links with Google
Webmaster Tools and Yahoo's link: command. The results are compared with
the goals, and the whole process is rened and repeated.

 57 http://igniteresearch.net/which-web-2-0-ranks-best-hubpages-vs-squidoo-vs-tumblr-vs-blogspot-etc/
 58 http://igniteresearch.net




                                       31
About the Author
Christopher Doman is a partner of Ignite Research, a rm specialising in soft-
ware and consultancies for search engine marketing. He holds a BA in Computer
Science from the University of Cambridge.




                                     32

Weitere ähnliche Inhalte

Was ist angesagt? (9)

Thesis_Report
Thesis_ReportThesis_Report
Thesis_Report
 
Dotcomology the science of making money online
Dotcomology   the science of making money onlineDotcomology   the science of making money online
Dotcomology the science of making money online
 
Dotcomology
DotcomologyDotcomology
Dotcomology
 
Google Search Quality Rating Program General Guidelines 2011
Google Search Quality Rating Program General Guidelines 2011Google Search Quality Rating Program General Guidelines 2011
Google Search Quality Rating Program General Guidelines 2011
 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
report.doc
report.docreport.doc
report.doc
 
Social Media Marketing- Fashion Merchandising- Final Project
Social Media Marketing- Fashion Merchandising- Final Project Social Media Marketing- Fashion Merchandising- Final Project
Social Media Marketing- Fashion Merchandising- Final Project
 
CallQ scope and user specification summary
CallQ scope and user specification summaryCallQ scope and user specification summary
CallQ scope and user specification summary
 
Access Tutorial
Access TutorialAccess Tutorial
Access Tutorial
 

Andere mochten auch

Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances goodJames Arnold
 
Done reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuildingDone reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuildingJames Arnold
 
Done reread the effect of new links on google pagerank
Done reread the effect of new links on google pagerankDone reread the effect of new links on google pagerank
Done reread the effect of new links on google pagerankJames Arnold
 
Done rerea dspamguide2003
Done rerea dspamguide2003Done rerea dspamguide2003
Done rerea dspamguide2003James Arnold
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)James Arnold
 
Done rerea dwebspam paper good
Done rerea dwebspam paper goodDone rerea dwebspam paper good
Done rerea dwebspam paper goodJames Arnold
 
Motivation Enhancement Therapy
Motivation Enhancement TherapyMotivation Enhancement Therapy
Motivation Enhancement Therapyjacod1
 

Andere mochten auch (7)

Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances good
 
Done reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuildingDone reread thecomputationalcomplexityoflinkbuilding
Done reread thecomputationalcomplexityoflinkbuilding
 
Done reread the effect of new links on google pagerank
Done reread the effect of new links on google pagerankDone reread the effect of new links on google pagerank
Done reread the effect of new links on google pagerank
 
Done rerea dspamguide2003
Done rerea dspamguide2003Done rerea dspamguide2003
Done rerea dspamguide2003
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
Done rerea dwebspam paper good
Done rerea dwebspam paper goodDone rerea dwebspam paper good
Done rerea dwebspam paper good
 
Motivation Enhancement Therapy
Motivation Enhancement TherapyMotivation Enhancement Therapy
Motivation Enhancement Therapy
 

Ähnlich wie Seo book

Dimensional modeling in a bi environment
Dimensional modeling in a bi environmentDimensional modeling in a bi environment
Dimensional modeling in a bi environmentdivjeev
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Priyanka Kapoor
 
Analytics configuration reference_sc61_a4
Analytics configuration reference_sc61_a4Analytics configuration reference_sc61_a4
Analytics configuration reference_sc61_a4samsherwood
 
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web PositioningAnalysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web PositioningPaweł Kowalski
 
bkremer-report-final
bkremer-report-finalbkremer-report-final
bkremer-report-finalBen Kremer
 
Thesis Nha-Lan Nguyen - SOA
Thesis Nha-Lan Nguyen - SOAThesis Nha-Lan Nguyen - SOA
Thesis Nha-Lan Nguyen - SOANha-Lan Nguyen
 
Migrating to netcool precision for ip networks --best practices for migrating...
Migrating to netcool precision for ip networks --best practices for migrating...Migrating to netcool precision for ip networks --best practices for migrating...
Migrating to netcool precision for ip networks --best practices for migrating...Banking at Ho Chi Minh city
 
digiinfo website project report
digiinfo website project reportdigiinfo website project report
digiinfo website project reportABHIJEET KHIRE
 
Yii blog-1.1.9
Yii blog-1.1.9Yii blog-1.1.9
Yii blog-1.1.9Netechsrl
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
 
Hello, android introducing google’s mobile development platform, 2nd editio...
Hello, android   introducing google’s mobile development platform, 2nd editio...Hello, android   introducing google’s mobile development platform, 2nd editio...
Hello, android introducing google’s mobile development platform, 2nd editio...Kwanzoo Dev
 
RDGB Corporate Profile
RDGB Corporate ProfileRDGB Corporate Profile
RDGB Corporate ProfileRejaul Islam
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 

Ähnlich wie Seo book (20)

Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
Dimensional modeling in a bi environment
Dimensional modeling in a bi environmentDimensional modeling in a bi environment
Dimensional modeling in a bi environment
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)
 
Analytics configuration reference_sc61_a4
Analytics configuration reference_sc61_a4Analytics configuration reference_sc61_a4
Analytics configuration reference_sc61_a4
 
Analysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web PositioningAnalysis Of The Modern Methods For Web Positioning
Analysis Of The Modern Methods For Web Positioning
 
bkremer-report-final
bkremer-report-finalbkremer-report-final
bkremer-report-final
 
Thesis Nha-Lan Nguyen - SOA
Thesis Nha-Lan Nguyen - SOAThesis Nha-Lan Nguyen - SOA
Thesis Nha-Lan Nguyen - SOA
 
Migrating to netcool precision for ip networks --best practices for migrating...
Migrating to netcool precision for ip networks --best practices for migrating...Migrating to netcool precision for ip networks --best practices for migrating...
Migrating to netcool precision for ip networks --best practices for migrating...
 
Ibm tivoli ccmdb implementation recommendations
Ibm tivoli ccmdb implementation recommendationsIbm tivoli ccmdb implementation recommendations
Ibm tivoli ccmdb implementation recommendations
 
digiinfo website project report
digiinfo website project reportdigiinfo website project report
digiinfo website project report
 
Report-V1.5_with_comments
Report-V1.5_with_commentsReport-V1.5_with_comments
Report-V1.5_with_comments
 
Yii blog-1.1.9
Yii blog-1.1.9Yii blog-1.1.9
Yii blog-1.1.9
 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...
 
Hello, android introducing google’s mobile development platform, 2nd editio...
Hello, android   introducing google’s mobile development platform, 2nd editio...Hello, android   introducing google’s mobile development platform, 2nd editio...
Hello, android introducing google’s mobile development platform, 2nd editio...
 
Yii2 guide
Yii2 guideYii2 guide
Yii2 guide
 
Rapidminer 4.4-tutorial
Rapidminer 4.4-tutorialRapidminer 4.4-tutorial
Rapidminer 4.4-tutorial
 
Final Report
Final ReportFinal Report
Final Report
 
RDGB Corporate Profile
RDGB Corporate ProfileRDGB Corporate Profile
RDGB Corporate Profile
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 

Mehr von James Arnold

Done reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteDone reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteJames Arnold
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)James Arnold
 
Done rerea dquality-rater-guidelines-2007 (1)
Done rerea dquality-rater-guidelines-2007 (1)Done rerea dquality-rater-guidelines-2007 (1)
Done rerea dquality-rater-guidelines-2007 (1)James Arnold
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)James Arnold
 
Done rerea dquality-rater-guidelines-2007 (1)(2)
Done rerea dquality-rater-guidelines-2007 (1)(2)Done rerea dquality-rater-guidelines-2007 (1)(2)
Done rerea dquality-rater-guidelines-2007 (1)(2)James Arnold
 
Done rerea dquality-rater-guidelines-2007 (1)(3)
Done rerea dquality-rater-guidelines-2007 (1)(3)Done rerea dquality-rater-guidelines-2007 (1)(3)
Done rerea dquality-rater-guidelines-2007 (1)(3)James Arnold
 
Done reread maximizingpagerankviaoutlinks
Done reread maximizingpagerankviaoutlinksDone reread maximizingpagerankviaoutlinks
Done reread maximizingpagerankviaoutlinksJames Arnold
 
Done reread maximizingpagerankviaoutlinks(3)
Done reread maximizingpagerankviaoutlinks(3)Done reread maximizingpagerankviaoutlinks(3)
Done reread maximizingpagerankviaoutlinks(3)James Arnold
 
Done reread maximizingpagerankviaoutlinks(2)
Done reread maximizingpagerankviaoutlinks(2)Done reread maximizingpagerankviaoutlinks(2)
Done reread maximizingpagerankviaoutlinks(2)James Arnold
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spamJames Arnold
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)James Arnold
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerankJames Arnold
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weJames Arnold
 

Mehr von James Arnold (14)

Done reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcompleteDone reread sketchinglandscapesofpagefarmsnpcomplete
Done reread sketchinglandscapesofpagefarmsnpcomplete
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
Done reread sketchinglandscapesofpagefarmsnpcomplete(3)
 
Done rerea dquality-rater-guidelines-2007 (1)
Done rerea dquality-rater-guidelines-2007 (1)Done rerea dquality-rater-guidelines-2007 (1)
Done rerea dquality-rater-guidelines-2007 (1)
 
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
Done reread sketchinglandscapesofpagefarmsnpcomplete(2)
 
Done rerea dquality-rater-guidelines-2007 (1)(2)
Done rerea dquality-rater-guidelines-2007 (1)(2)Done rerea dquality-rater-guidelines-2007 (1)(2)
Done rerea dquality-rater-guidelines-2007 (1)(2)
 
Done rerea dquality-rater-guidelines-2007 (1)(3)
Done rerea dquality-rater-guidelines-2007 (1)(3)Done rerea dquality-rater-guidelines-2007 (1)(3)
Done rerea dquality-rater-guidelines-2007 (1)(3)
 
Done reread maximizingpagerankviaoutlinks
Done reread maximizingpagerankviaoutlinksDone reread maximizingpagerankviaoutlinks
Done reread maximizingpagerankviaoutlinks
 
Done reread maximizingpagerankviaoutlinks(3)
Done reread maximizingpagerankviaoutlinks(3)Done reread maximizingpagerankviaoutlinks(3)
Done reread maximizingpagerankviaoutlinks(3)
 
Done reread maximizingpagerankviaoutlinks(2)
Done reread maximizingpagerankviaoutlinks(2)Done reread maximizingpagerankviaoutlinks(2)
Done reread maximizingpagerankviaoutlinks(2)
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
 
Done reread deeperinsidepagerank
Done reread deeperinsidepagerankDone reread deeperinsidepagerank
Done reread deeperinsidepagerank
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide we
 
Seo book
Seo bookSeo book
Seo book
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Kürzlich hochgeladen (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Seo book

  • 1. 1
  • 2. Contents I Theory 5 0.1 Why SEO is important . . . . . . . . . . . . . . . . . . . . . . . . 5 0.2 Dierent needs from SEO . . . . . . . . . . . . . . . . . . . . . . 5 1 What is a Search Engine? 7 1.1 History of Search Engines . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Important Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.2 Dynamic Data . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Spam and Manipulation . . . . . . . . . . . . . . . . . . 8 1.3 How a Search Engine works . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Text acquisition . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.2 Duplicate Content Detection . . . . . . . . . . . . . . . . 10 1.3.3 Text transformation . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Index Creation . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.5 User Interaction . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.6 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 How good can a search engine be? 13 2.1 NP Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 AI Hard Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Competitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Ranking Factors 15 3.1 On Page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 O Page Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Google PageRank Notes . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Short Description . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Mathematical Description . . . . . . . . . . . . . . . . . . 19 3.3.3 Interesting Notes on the Original Implementation of PageR- ank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.4 Optimal Linking Strategies . . . . . . . . . . . . . . . . . 21 3.3.5 Implementation to make computing PageRank faster . . . 23 3.3.6 HITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.7 Is linking out a good thing? . . . . . . . . . . . . . . . . . 23 3.3.8 TrustRank / Bad Page Rank . . . . . . . . . . . . . . . . 24 3.3.9 Improvements to Google's ranking algorithms . . . . . . . 25 2
  • 3. 4 Detecting Spam and Manipulation 27 4.1 Google Webmaster Guidelines . . . . . . . . . . . . . . . . . . . . 27 4.2 Penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Detecting Manipulation in Content . . . . . . . . . . . . . . . . . 28 4.4 Detecting Manipulation in Links . . . . . . . . . . . . . . . . . . 28 4.5 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 II Practice 29 5 An Example Campaign 30 5.1 Company Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Competitor Research . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.4 Keyword Research . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.5 Content Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.6 Website Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.7 Link Building . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.8 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3
  • 4. Preface This book aims to provide a general overview of how search engines rank doc- uments in practice, the core of which will remain true even as Search Engine's algorithms are rened. 4
  • 5. Part I Theory 0.1 Why SEO is important ˆ A higher search engine result will receive exponentially greater clicks than a lower one For example, if a search was repeated 1000 times by dierent users, this is typically how many clicks each result would get. Position Clicks 1 222 2 63 3 45 4 32 5 26 6 21 7 18 8 16 9 15 10 16 Source: Leaked Aol Click Data ˆ Paid adverts have low click through rates, and get expensive quickly Search Engine % Organic Click Through Rate % Paid Result Click Through Rate Google 72 28 Yahoo 61 39 MSN 71 29 AOL 50 50 Average 63 37 88% of online search dollars are spent on paid results, even though 85% of searchers click on organic results. Vanessa Fox, Marketing in the Age of Google, May 3, 2010 0.2 Dierent needs from SEO There are many dierent reasons you may wish to engage in optimising your search results, including ˆ Money - Sales for e-commerce sites are directly correlated with trac. ˆ Reputation - Some companies go to the extent of pushing negative arti- cles down in the rankings. 5
  • 6. ˆ Branding - Coming up top in the results pages is impressive to customers, and is particularly important in industries where reputation is extremely important. 6
  • 7. 1 What is a Search Engine? 1.1 History of Search Engines The rst mechanised information retrieval sysyems were built by the US military to analyse the mass of documents being captured from the Germans. Research was boosted when the UK and US governments funded research to reduce a perceived science gap with the USSR. By the time the internet was becoming commonplace in the early 1990s information retrieval was at an advanced stage. Complicated methods, primarily statistical, had been developed an archives of thousands of documents could be searched in seconds. Web search engines are a special case of information retrieval systems, ap- plied to the massive collection of documents available on the internet. A typical search engine in 1990 was split into two parts: a web spider that traverses the web following links and creating a local index of the pages, then traditional in- formation retrieval methods to search the index for pages relevant to the users query and order the pages by some ranking function. Many factors inuence a person's decision about what is relevant, such as the current task, context and freshness. In 1998 pages were primarily ranked by their contextual content. Since this is entirely controlled by the owner of the page, results were easy to manipulate and as the Internet became ever more commercialized the noise from spam in SERP's (search engine results pages) made search a frustrating activity. It was also hard to discern websites which more people would want to visit, for example a celebrities ocial home page, from less wanted websites with similar content, for example a site. For these reasons directory sites such as Yahoo were still popular, despite being out of date and making the user work out the relevance Google's founders Larry Page and Sergey Brin's Page Rank innovation (named after Larry Page), and that of a similar algorithm also released in 1998 called Hyperlink-induced Topic Search (HITS) by Jon Kleinberg, was to use the addi- tional meta information from the link structure of the Internet. A more detailed description of Page Rank will follow in [chapter], but for now Google's own de- scription will suce. PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves important weight more heavily and help to make other pages important. Whilst it is impossible to know how Google has evolved their algorithms since the 1998 paper that launched page rank, and how real world ecient implementation diers from the theory, as Google themselves say the PageRank algorithm remains the heart of Google's software ... and continues to provide the basis for all of [their] web search tools. The search engines continue to evolve at a blistering pace, improving their ranking algorithms (Google says 7
  • 8. 1 there are now over 200 ranking factors considered for each search ), and indexing a growing Internet more rapidly. 1.2 Important Issues The building of a system as complex as a modern search engine is all about balancing dierent positive qualities. For example, you could eectively prevent low quality spam by paying humans to review every document on the web, but the cost would be immense. Or you could speed up your search engine by considering only every other document your spider encounters, but the relevance of results would suer. Some things, such as getting a computer to analyse a document to with the same quality as a human, are theoretically impossible today, but Google in particular is pushing boundaries and getting ever closer. Search engines have some particular considerations: 1.2.1 Performance The response time to a user's query must be lightening fast. 1.2.2 Dynamic Data Unlike a traditional information retrieval system in a library the pages on the Internet are constantly changing. 1.2.3 Scalability Search engines need to work with billions of users searching through trillions of documents, distributed across the Earth. 1.2.4 Spam and Manipulation Actively engaging against other humans to maintain the relevancy of results is relatively unique to search engines. In a library system you may have an author that creates a long title packed with words their readers may be interested in, but that's about the worst of it. When designing your search engine you are in a constant battle with adversaries who will attempt to reverse engineer your algorithm to nd the easiest ways to aect your restyles. A common term for this relation ship is Adverse rial Information Retrieval. The relationship between the owner of a Web site trying to rank high on a search engine and the search engine designer is an adversarial relationship in a zero-sum game. That is, assuming the results were better before, every gain for the web site owner is a loss for the search engine designer. Classifying where your eorts cross helping a search engine be aware of your web site's content and popularity, which should help to improve a search engine's results, and start instead ranking beyond your means and start decreasing the quality of a search engine's results can be 1 See http://googlewebmastercentral.blogspot.com/2008/10/ good-times-with-inbound-links.html 8
  • 9. somewhat tricky. The practicalities of what search engines consider to be spam, and as importantly what they can detect and x, will be discussed later. 2 According to Web Spam Taxonomy , approximately 10-15% of indexed content on the web is spam. What is considered spam and duplicate content varies, which makes this statistic hard to verify. There is a core of about 56 million pages 3 that are highly interlinked at the center of the Internet, and are less likely to be spam. Document's further away (in link steps) from this core are more likely be spam. Deciding the quality of a document well (say whether it is a page written by an expert in the eld, or generated by a computer program using natural language processing) is an AI Complete problem, that is it won't be possible until we have articial intelligence that can match that of a human. However, search engines hope to get spam under control by lessening the nancial incentive of spam. This quote from a Microsoft Research paper 4 ex- presses this nicely: Eectively detecting web spam is essentially an arms race be- tween search engines and site operators. It is almost certain that we will have to adapt our methods overtime, to accommodate for new spam methods that the spammers use. It is our hope that our work will help the users enjoy a better search experience on the web.Victory does not require perfection, just a rate of detec-tion that alters the economic balance for a would-be spammer. It is our hope that continued research on this front can make eective spam more expensive than genuine content. 5 Google developers for their part describe web spam as the following , citing the detrimental impact it has upon users These manipulated documents can be referred to as spam. When a user receives a manipulated document in the search results and clicks on the link to go to the manipulated document, the document is very often an advertisement for goods or services unrelated to the search query or a pornography website or the manipulated document automatically forwards the user on to a website unrelated to the user's query. 1.3 How a Search Engine works A typical search engine can split into two parts: Indexing, where the Internet is transformed into an internal representation that can be eciently searched. The query process, where the index is searched for the user query and documents are ranked and returned to the user in a list. Indexing 2 Zoltán Gyöngyi and Hector Garcia-Molina, Stanford University. First International Work- shop on Adversarial Information Retrieval on the Web, May 2005 3 See On Determining Communities in the Web by K Verbeurg 4 See Detecting Spam Web Pages through Content Analysis by A Ntoulas 5 See patent 7302645: Methods and systems for identifying manipulated articles 9
  • 10. 1.3.1 Text acquisition A crawler starts at a seed site such as the DMOZ directory, then repeatedly follows links to nd documents across the web, storing the content of the pages and associated meta data (such as the date of indexing, which page linked to the site). In a modern search engine the crawler is constantly running, downloading thousands of pages simultaneously, to continuously update and expand the in- dex. A good crawler will cover a large percentage of the pages on the Internet, and visit popular pages frequently to keep its index fresh. A crawler will connect to the web server and use a HTTP request to retrieve the document, if it has changed. On average, Web page updates follow the Poisson distribution - that is the crawler can expect the time until the web page updates next time to follow an exponential distribution. Crawlers are now also indexing near real time data through varying sources such as access to RSS Feeds and the Twitter API, and are able to index a range of formats such as PDF's and Flash. These formats are converted into a common intermediate format such as XML. A crawler can also be asked to update its copy of a page via methods such as a ping or XML sitemap, but the update time will still be up to the crawler. The document data store stores the text and meta data the crawler retrieves, it must allow for very fast access to a large amount of documents. Text can be compressed relatively easily, and pages are typically indexed by a hash of their URL. Google's original patent used a system called BigTable, Google now keeps documents in sections called shards distributed over a range of data centres (this oers performance, redundancy and security benets). 1.3.2 Duplicate Content Detection Detecting exact duplicates is easy, remove the boilerplate content (menus etc.) then compare the core text through check sums. Detecting near duplicates is harder, particularly if you want to build an algorithm that is fast enough to compare a document against every other document in the index. To perform faster duplicate detection, nger prints of a document are taken. A simple ngerprinting algorithm for this is outlined here: 1. Parse the document into words, and remove formatting content such as punctuation and HTML tags. 2. The words are grouped into groups of words (called n-grams, a 3-gram being 3 words, 4-gram 4 words etc.) 3. Some of these n-grams are selected to represent a document 4. The selected n-grams are hashed to create a shorter description 5. The hash values are stored in a quick look up database 6. The documents are compared by looking at overlaps of ngerprints. 10
  • 11. Fingerprinting in action A paper 6 by four Google employees found the following statistics across their index of the web. Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of vegrams: 1,176,470,663 Most common trigram in English: all rights reserved Detecting unusual patterns of n-grams can also be used to detect low qual- ity/spam documents . 7 1.3.3 Text transformation Tokenization is the process of splitting a series of characters up into separate words. These tokens are then parsed to look for tokens such as a /a to nd which parts of the text is plain text, links and such. ˆ Identifying Content Sections of documents that are just content are found, in an attempt to ignore boiler plate content such as navigation menus. A simple way is to look for sections where there are few HTML tags, more complicated methods consider the visual layout of the page. ˆ Stopping Common words such as the and and are removed to increase the eciency of the search engine, resulting in a slight loss in accuracy. In general, the more unusual a word the better it is at determining if a document is relevant. 6 See N-gram Statistics in English and Chinese: Similarities and Dierences 7 See http://www.seobythesea.com/?p=5108 11
  • 12. ˆ Stemming Stemming reduces words to just their stem, for example computer and com- puting become comput. Typically around a 10% improvement is seen in relevance in English, and up to 50% in Arabic. The classic stemmer algorithm is the Porter Stemmer which works through a series of rules such as replace sses with ss to stresses - stress. ˆ Information Extraction Trying to determine the meaning of text is very dicult in general, but certain words can give clues. For example the phrase x has worked at y is useful when building an index of employees. 1.3.4 Index Creation Document statistics such as the count of words are stored for use in ranking algorithms. 8 is created to allow for fast full text searches. An inverted index 9 The index is distributed across multiple data centres across the globe . 1.3.5 User Interaction The user is provided with an interface in which to give their query. The query is then transformed, using similar techniques to with documents such as stem- ming, as well as spell checking and expanding the query to nd other queries synonymous with the users query. After ranking the document set, a top set of results are displayed together with snippets to show how they were matched. 1.3.6 Ranking A scoring function calculates scores for documents. Some parts of the scoring can be performed at query time, others at document processing time. 1.3.7 Evaluation Users queries and their actions are logged in detail for improve results. For example, if a user clicks on a result then quickly performs the same search again, it is likely that they clicked a poor result. 8 An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its document in a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added http://en.wikipedia.org/wiki/Inverted_index to the database. 9A approach is at http://highscalability.com/ good overview of Google's shard google-architecture 12
  • 13. 2 How good can a search engine be? There are some very specic limits in computer science as to what a computer program is capable of doing, and these have direct consequences for how search engines can index and rank your web pages. The two core sets or problems are NP-Complete problems, which for large sets of data take too long to solve perfectly, and AI-Complete problems, which can't be done perfectly until we have computers that are intelligent as people. That doesn't mean search engines can't make approximations, for example nding the shortest route on a map is a NP-Complete problem yet Google maps still manages to plot pretty good routes 10 . 2.1 NP Hard Problems Polynomial (P) problems can be solved in polynomial time, that is relatively quickly. Non Polynomial (NP) problems cannot be solved in polynomial time, that is they can't be solved for any reasonably large set of inputs such as a number of web pages. The time taken to solve the NP hard problem (in red) grows extremely quickly as the size of the problem grows. These concepts become complex quickly, but the key thing to pick up is that if a problem is NP Hard there is no way it can ever be solved perfectly for something as large as a search engines index, and approximations will have to be used. There are some NP Hard problems that are of particular interest to SEO: ˆ The Hamiltonian Path Problem - Detecting a greedy network (IE if you interlink your web pages to hoard page rank) in the structure of a Hamil- tonian path 11 is an NP hard problem ˆ Detecting Page Farms (the set of pages that link to a page) is NP hard 12 10 http://www.youtube.com/watch?v=-0ErpE8tQbw 11 http://en.wikipedia.org/wiki/Hamiltonian_path 12 See Sketching Landscapes of Page Farms by Bin Zhou and Jian Pei 13
  • 14. ˆ Detecting Phrase Level Duplication in a Search Engine's Index 13 2.2 AI Hard Problems AI Hard problems require intelligence matching that of a human being to be solved. Examples include the Turing Test (tricking a human into thinking they are talking to a human, not a computer), recognising dicult CAPTCHA's and translating text as well as an expert (who wouldn't be perfect either). During a question-and-answer session after a presentation at his alma mat- ter,Stanford University, in May 2002, Page said that Google would full its mis- sion only when its search engine was AI-complete, and said something similar in an interview with Newsweek then Playboy. I think we're pretty far along compared to 10 years ago, he said. At the same time, where can you go? Certainly if you had all the world's information directly attached to your brain, or an articial brain that was smarter than your brain,you'd be better. Between that and today, there's plenty of space to cover. What would a perfect search engine look like? we asked. It would be the mind of God 14 And, actually, the ultimate search engine, which would understand, you know, exactly what you wanted when you typed in a query, and it would give you the exact right thing back, in computer science we call theatrical intelli- gence. That means it would be smart, and we're a long way from having smart computers. 15 Of particular interest to SEO is that fully understanding the meaning of human text is an AI complete problem, and even getting close to understanding words in context is very dicult 16 . This means detecting the quality of reason- able quality computer generated text against that of a human expert automat- ically is tricky. Its not unusual to see websites packed with decent computer generated text (which automatically detecting is an AI complete problem) and single phrases stitched together from a variety of sources (which is an NP com- plete problem) ranking for Google Trends results. This is particularly hard to stop as for new news items there are less fresh sources available to choose from, this results in search engine poisoning 17 . Any site that receives a large amount of trac from this will eventually be visited manually by a Google employee, and penalised manually 18 . Google's solution to the very similar machine translation problem is inter- esting; rather than attempting to build AI they use their massive resources and data stored from web pages and user queries to build a reliable statistical engine 13 See Detecting phrase-level duplication on the world wide web by Microsoft Research employees 14 http: // searchenginewatch. com/ 2156601 15 http: // tech. fortune. cnn. com/ 2011/ 02/ 17/ is-something-wrong-with-google/ 16 http://en.wikipedia.org/wiki/Natural_language_understanding 17 http://igniteresearch.net/spam-in-poisoned-world-cup-results/ 18 http://www.google.co.uk/search?q=Google+Spam+Recognition+Guide+for+Quality+ Rater 14
  • 15. - their approach isn't necessarily far smarter than their competitors but their resources make them the best translator out there. 2.3 Competitors Although not a classic computer science problem, a big limit to how search engines can treat possible spam is that competitors could attempt to make your website look like it was spamming to lower your ranking, increasing theirs. For example, if your website suddenly receives and inux of low quality links from sites known to ink to spam, how would Google know if you naively ordered this or a competitor did? This is an unsolvable problem, short of non-stop surveillance of all website owners. This is what Google has to say on the matter 19 There's almost nothing a competitor can do to harm your ranking or have your site removed from our index. If you're concerned about another site linking to yours, we suggest contacting the webmaster of the site in question. Google aggregates and organizes information published on the web; we don't control the content of these pages. I can say from experience that Google bowling most certainly does happen, and there are a couple of experiments written up on the web 20 , though it would be very dicult to Google bowl a popular website. Essentially, if a small per- centage of links to a site are most likely spam they are just ignored, if a large percentage are likely spam then the links may result in a penalty rather than just being ignored. It seems likely that poor quality links are increasingly being ignored. The paper Link Spam Alliances from Stanford, the Google founder's Alma mater, discusses both dated methods of detecting and punishing potential link spam. Note that link spam isn't the only way that sites can potentially be Google bowled, if your competitor lls your comment section with duplicate content about organ enlargement and links to known phishing sites it is unlikely to help your rankings. Google now also takes into account users choosing to block sites from results 21 , presumably with a negative eect. 3 Ranking Factors Google engineers update their algorithms daily 22 . They then run many tests to check they have the right balance between all these factors. The following is from an interview with Google's Udi Manber. Q: How do you determine that a change actually improves a set of results? A: We ran over 5,000 experiments last year. Probably 10 experiments for ev- ery successful launch. We launch on the order of 100 to 120 a quarter. We have 19 http://www.google.com/support/webmasters/bin/answer.py?answer=34449 20 http://bit.ly/jEKzMa 21 http://googlewebmastercentral.blogspot.com/2011/04/high-quality-sites-algorithm-goes. html 22 http://www.nytimes.com/2007/06/03/business/yourmoney/03google 15
  • 16. dozens of people working just on the measurement part. We have statisticians who know how to analyze data, we have engineers to build the tools. We have at least 5 or 10 tools where I can go and see here are 5 bad things that happened. Like this particular query got bad results because it didn't nd something or the pages were slow or we didn't get some spell correction. I have created a spreadsheet that shows how a search engine may cal- culate the ranking of a trivial set of documents for a particular query, you http://igniteresearch.net/ can view it and try changing things yourself at poodle-a-simple-emulation-of-search-engine-ranking-factors/. 3.1 On Page Factors ˆ Keywords Repetitions of the words in the query in the document, particularly in key areas such as the title and headers are positive signals of relevance. The proximity of the words together is important, particularly having the exact query in the document. A very large repetition, particularly in nongrammatical sentences, can be a negative signal of spam. Presence of the query words in the Domain and URL are useful signals of relevance. Related phrases to the query are also positive signals of relevance (see Latent Semantic Indexing). The meta keywords HTML tag, meta name=keywords content=my, keywords, is largely ignored by modern search engines 23 . ˆ Quality A number of dierent authors on a website, good grammar, spelling and long pages written at reasonable time intervals are positive signs of high quality content 24 . ˆ Geographical Locality Mentions of an address close the user show the document may be geographically relevant to the user, particularly for geograpihcally sensitive queries such as plumbers in london. ˆ Freshness For time dependant queries, such as news events, recent pages are more likely to be helpful to the user. See Google's Quality Deserves Freshness drive, of which Google's faster indexing Caeine update was a part. ˆ Duplicate Content Large percentages of content duplicated either from the same site, or others is an indicator of poor quality content and users will only want to see the canonical copy. 23 See http://googlewebmastercentral.blogspot.com/2009/09/ google-does-not-use-keywords-meta-tag.html 24 See http://www.seobythesea.com/?p=541 16
  • 17. ˆ Adverts A very large number of adverts can reduce the user experience, and aliate links are often associated with heavily SEO manipulated websites. ˆ Outbound Links Links to spammy of phising websites, or an unusually large number of outbound links on a number of pages, are common indicators of a page that users will not want to visit 25 . ˆ Spam An unusual repetition of keywords, particularly outside of sentences is a sign of spam. Techniques such as hidden text and sneaky javascript redirects are relatively easy to detect and punish. 3.2 O Page Factors ˆ Site Reliability Unreliable or slow sites provide a poor user experience, and so will have a penalty applied. You can be warned if this happens if you sign up for Google webmaster tools 26 . ˆ Popularity of the Site From aggregated ISP data that search engine's buy and search trac 27 . ˆ Incoming Links/ PageRank The link structure of the internet is a useful pointer of a websites popularity. Anchor text on incoming links related to query shows a search engine the page is related to the query. Links they remain for a long time from sites that have many links pointing to themselves are rated highly. Links that are in boiler plate areas or sitewide may be ignored. Links that are all identical in anchor text (ie blatantly machine generated), from spammy websites (bad neighbourhoods 28 ), thought to be paid for with the intention of manipulating rankings or spam can result in penalties. Links from sites that are most likely owned by the same owner, detected either from Whois data or if the sites are hosted within the same Class C IP, are likely considered less reliable signals of importance. A normal rate of growth of incoming links, as opposed to bursty start stops 29 that indicate link building campaigns 30 . 25 See Improving Web Spam Classiers Using Link Structure for a very interesting Yahoo patent on detecting spam based on the number of inbound and outbound links 26 See http://www.mattcutts.com/blog/site-speed/ 27 See http://trends.google.com/websites?q=bing.comgeo=alldate=all and http:// www.compete.com 28 See http://www.google.com/support/webmasters/bin/answer.py?answer=35769 29 See http://www.seobook.com/link-growth-profile 30 See http://www.wolf-howl.com/seo/google-patent-analysis/ 17
  • 18. ˆ Other indirect signals of a website's popularity Other data can include mentions in chats, emails and social networks. ˆ Links from trusted websites The proximity on web graph to important, trusted sites (Links from old, high page rank websites at the centre of the old heavily interconnected internet are useful signals that a website can be trusted and is important 31 ). ˆ Links from other sites that rank for the query Results may be reordered based on how they link to each other. ˆ Geographical Location If the geographical location of server, website according to directories, top level domain or location as set in Google Webmaster Tools match that of the user it is a signal that the page will be more relevant to the user, particularly for location sensitive searches. ˆ User Click Data If users often search again after clicking on the sites result that is an indicator that the page is not a good match for the query. The personal history of results clicked, and pattern of related searches may help indicate what a user is looking for 32 . ˆ Domain Information Older domains are likely trusted more. Google is a domain registrar so has ex- tensive information Whois Information, and validates that address information associated with domains is correct. ˆ Manul Reviews Google Quality Raters 33 manually reviewing websites and tagging them as cat- egories such as essential to query, not relevant to query, spam. 3.3 Google PageRank Notes Google's PageRank was the innovation that propelled Google to the top of the search engine pile. Whilst its implementation has changed much since its original description, and many other factors are now taken into account, it is still at the heart of modern search engines so some extra notes will be made on it here. 31 See http://www.touchgraph.com/seo and type in http://www.nasa.gov for a visual graph 32 See Seehttp://www.seobythesea.com/?p=334 33 See http://searchengineland.com/the-google-quality-raters-handbook-13575 18
  • 19. 3.3.1 Short Description The key point is that PageRank considers each link a vote, and links from pages which have many links themselves are considered more important. Or as Google puts it: PageRank reects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. 3.3.2 Mathematical Description Its not essential to have a mathematical understanding of how PageRank is cal- culated, but for those familiar with basic graph theory and algebra it is useful. You may wish to skip this section, and read a slightly less mathematical de- 34 . For a more complete treatment of the mathematics see the original scription PageRank paper 35 , the Deeper Inside PageRank by Amy N. Langvilleand and Carl D, and this thesis 36 . The following is summarised from Sketching Landscapes of Page Farms 37 by Bin Zhou and Jian Pei: The Web can be modeled as a directed Web graph G = (V, E), where V is the set of Web pages, and E is the set of hyperlinks. A link from page p to page q is denoted by edge p q. An edge p q can also be writte nas a tuple (p, q). PageRank measues the importance of a page p by considering how collec- tively other Web pages point to p directly or indirectly. Formally, for a Web page p, the PageRank score is dened as: Where M(p) = { q| q p } is the set of pages having a hyperlink point to p, OutDeg(pi ) is the out-degree of pi (i.e., the number of hyperlinks from pi pointing to some pages other than pi ), and d is a damping factor (0.85 in the original PageRank implementation) which models the random transitions of the web. If a damping factor of 0.5 is used then at each page there is a 50/50 34 See the introductions of http://www.sirgroane.net/google-page-rank/, http://www. webworkshop.net/pagerank.html or the Wikipedia article 35 At http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf 36 http://web.engr.oregonstate.edu/~sheldon/papers/thesis.pdf 37 See http://www.cs.sfu.ca/~bzhou/personal/paper/sdm07_page_farm.pdf 19
  • 20. chance of the surfer clicking a link, or jumping to a random page on the internet. Without the damping factor the PageRank of any page with an outgoing link would be 0. To calculate the PageRank scroes for all pages in a graph, one can assign a random PageRank score value to each node in the graph, then apply the above equation iteratively until the PageRank scroes in the graph converge. The google toolbar is a logarithmic scale out of 10, not the actual internal data. For example: Domain Calculated PageRank PageRank displayed in Toolbar small.com 47 2 medium1.com 54093 5 medium2.com 84063 5 big.com 1234567 7 big2.com 2364854 7 3.3.3 Interesting Notes on the Original Implementation of PageR- ank From PageRank Uncovered 38 , essential reading for those looking to understand PageRank from an SEO perspective: ˆ PageRank is a multiplier, applied after relevant results are found Remember, PageRank alone cannot get you high rankings. We've mentioned before that PageRank is a multiplier; so if your score for all other factors is 0 andyour PageRank is twenty billion, then you still score 0 (last in the results). This isnot to say PageRank is worthless, but there is some confusion over when PageRank is useful and when it is not. This leads to many misinterpretations of its worth. The only way to clear up these misinterpretations is to point out when PageRank is not worth while.If you perform any broad search on Google, it will appear as if you've found several thousand results. However, you can only view the rst 1000 of them. Understanding why this is so, explains why you should always concentrate on on the page factors and anchor text rst, and PageRank last. ˆ Each page is born with a small amount of PageRank A page that is in the Google index has a vote, however small. Thus, the more pages you have in the index the more overall vote you are likely to have. Or,simply put, bigger sites tend to hold a greater total amount of PageRank within their site (as they have more pages to work with). Note that Google's original algorithm has most likely been amended since to detect and reduce page rank hoarding, and generating PageRank by massive interlinking on auto generated pages. Also for quicker calculations an approx- 38 See http://www.bbs-consultant.net/IMG/pdf_PageRank.pdf 20
  • 21. imation of PageRank which only gives certain seed pages PageRank may be used 39 . Interestingly, however, there are examples of this working, see How to get billions of pages indexed in Google at http://www.threadwatch.org/node/ 6999. In a related issue, at one point 10% of MSN Search's (now known as Bing) German index was computer generated content on a single domain 40 . 3.3.4 Optimal Linking Strategies Deciding how to interlink pages that you own or have inuence over is tricky; interlinking can be a good signal that that pages are related and on a certain topic, build PageRank and control PageRank ow. However, heavily interlinking can be a signal of manipulation and spam, and dierent linking structures can make dierent sites in your possession rank higher. The mathematics gets tricky fast, here is a quick overview of the literature today: ˆ Note from Web Spam Taxonomy Though written about Spam farms, the math holds true for good commercial sites too. Essentially this states that maximum page rank for a target page is achieved by linking only to the target page from forums, blogs etc. then interlinking the network of sites owned (as if there are no outlinks on a page the random surfer will jump to a random page on the Internet). 1. Inaccessible pages are those that a spammer cannot modify. These are the pages out of reach; the spammer cannot inuence their outgoing links. (Note that a spammer can still point to inaccessible pages.) 2. Accessible pages are maintained by others (presumably not aliated with the spammer), but can still be modied in a limited way by a spammer. For example, a spammer may be able to post a comment to a blog entry, and that comment may contain a link to a spam site. 3. Own pages are maintained by the spammer, who thus has full control over their contents. We can observe how the presented structure maximizes the total PageRank score of the spam farm, and of page t in particular: 1. All available n own pages are part of the spam farm, maximizing the static score total PageRank. 2. All m accessible pages point to the spam farm, maximizing the incoming score incoming PageRank. 39 For more on why this shouldn't work see http://www.pagerank.dk/Pagerank/ Generate-pagerank.htm 40 See http://research.microsoft.com/pubs/65144/sigir2005.pdf 21
  • 22. 3. Links pointing outside the spam farm are suppressed, making PRout out- going PageRank zero. 4. All pages within the farm have some outgoing links, rendering a zero PRsink score component. Within the spam farm, the the score of page t is maximal because: 1. All accessible and own pages point directly to the target, maximizing its incoming score PRin (t). 2. The target points to all other own pages. Without such links, t would had lost a signicant part of its score (PRsink (t) 0), and the own pages would had been unreachable from outside the spam farm. Note that it would not be wise to add links from the target to pages outside the farm, as those would decrease the total PageRank of the spam farm. ˆ From Link Spam Alliances The analysis that we have presented show how the PageRank of target pages can be maximized in spam farms. Most importantly, we nd that there is an entire class of farm structures that yield the largest achievable target PageRank score. All such optimal farm structures share the following properties: 1. All boosting pages point to and only to the target. 2. All hijacked point to the target. 3. There are some links from the target to one or more boosting pages. ˆ From Maximizing PageRank via Outlinks In this paper we provide the general shape of an optimal link structure for a website in order to maximize its PageRank. This structure with a forward chain and every possible backward link may be not intuitive. At our knowledge, it has never been mentioned, while topologies like a clique, a ring or a star are considered in the literature on collusion and alliance between pages. Moreover, this optimal structure gives new insight into the armation of Bianchini et al. that, in order to maximize the PageRank of a website, hyperlinks to the rest of the webgraph should be in pages with a small PageRank and that have many internal hyperlinks. More precisely, we have seen that the leaking pages must be chosen with respect to the mean number of visits before zapping they give to the website, rather than their PageRank. ˆ From The eect of New Links on PageRank by Xie Theorem: The optimal linking strategy for a Web page is to have only one out- going link pointing to a Web page with a shortest mean rst passage time back to the original page. Conclusions: .... We conclude that having no outgoing link is a bad policy and that the best policy is to link to pages from the same Web community. Surprisingly, a new incoming link might not be good news if a page that points to us gives many other irrelevant links at the same time. Reading this paper fully it is only in very particular circumstances that a new incoming link is not good news. 22
  • 23. 3.3.5 Implementation to make computing PageRank faster There have been a number of proposed improvements to the original PageRank algorithm to improve the speed of calculation 41 , and to adapt it to be better at determining quality results. No search engine calculates PageRank as shown in the naive algorithm in the original paper 42 . 3.3.6 HITS HITS is another ranking algorithm that takes into account the pattern of links found throughout the web, and it was released just before PageRank in 1999. HITS treats some pages on the web as authorities, which are good documents on a topic, and hubs, which mostly link to authorities. A page is given a high authority score by being linked to by pages that are recognized as Hubs for information. A page is given a high hub score by linking to nodes that are considered to be authorities on the subject. Unlike PageRank, which is query independent and so computed at index- ing time, HITS hub and author scores are query depend ant and so computed (though likely cached) at query time. 3.3.7 Is linking out a good thing? Whilst TEOMA is the only search engine that uses HITS at its core, its think- ing has heavily inuenced search engine designers - so it is likely that linking out to high quality authorities can positively inuence either a pages ranking (though potentially negatively, if designers want authorities rather than hubs to appear in their results 43 ), or the importance of the other links it contains. Many webmasters fear linking out to sites as they would rather keep links internal to prevent PageRank owing out (many webmasters also nofollow links to similar reasons, not that this form of PageRank sculpting no longer works according to Matt Cutts, Google's head of [anti]web spam). Matt Cutts also said a number of years ago: Of course, folks never know when we're going to adjust our scoring. It's pretty easy to spot domains that are hoarding PageRank; that can be just another factor in scoring. Some search engines are even concerned about people linking out too much, whilst crawlers can now index a large number of links on a page, a very large number of outbound links often indicates that a site has been hacked with spam links or is machine generated. A spammer might manually add a number of outgoing links to well-known pages, hoping to increase the page's hub score. At the same time,the most 41 For example, see Computing PageRank using Power Extrapolation and Ecient PageR- ank Approximation via Graph Aggregation 42 Matt Cutts discusses a couple of the implementation details at http://www.mattcutts. com/blog/more-info-on-pagerank/ 43 See http://www.wolf-howl.com/seo/seo-case-study-outbound-links/ and Deeper In- side PageRank, discussed earlier 23
  • 24. wide-spread method for creating a massive number of outgoing links is direc- tory cloning 44 . 3.3.8 TrustRank / Bad Page Rank Its likely that after results are generated based on relevance, PageRank is then applied to help order, then Trust Rank to help order the results. A site may lose trust every time it fails some kind of spam test (for example if a large number of reciprocal links are found,cloaking, duplicate content, fake whois data) and gain Trust for certain properties (domain age, trac, being one a number of important seed sites that are manually tagged as trusted sites). These initial Trust Ranks could then be propagated in a similar way to PageRank, so linking to and from bad neighborhoods would negatively aect the sites Trust Rank through association 45 . From SEO By The Sea: In 2004, a Yahoo whitepaper was published which described how the search engine might attempt to identify web spam by looking at how dierent pages linked to each other. That paper was mistakenly attributed to Google by a large number of people, most likely because Google was in the process of trademarking the term TrustRank around the same time, but for dierent reasons. Surpris- ingly, Google was granted a patent on something it referred to as Trust Rank in 2009, though the concept behind it was dierent than Yahoo's description of TrustRank. Instead of looking at the ways that dierent sites linked to each other, Google's Trust Rank works to have pages ranked according to a measure of the trust associated with entities that have provided labels for the documents. 44 See Web Spam Taxonomy 45 See http://bakara.eng.tau.ac.il/semcomm/GKRT.pdfand http://www. freepatentsonline.com/7603350.html and http://www.cs.toronto.edu/vldb04/protected/ eProceedings/contents/pdf/RS15P3.PDF 24
  • 25. ... If you've ever heard or seen the phrase TrustRank before, it's possible that whoever was writing about it, or referring to it was discussing a paper titled Com- bating Web Spam with TrustRank (pdf ). While the paper was the joint work of researchers from Stanford University and Yahoo!, many writers have attributed it to Google since its publication date in 2004 The confusion over who came up with the idea of TrustRank wasn't helped by Google trademarking the term TrustRank in 2005. That trademark was abandoned by Google on February 29, 2008, according to the records at the US PTO Tess database. However, a patent called Search result ranking based on trust deals with something called trust rank, led on May 9, 2006. Google mentions distrust and trust changes as indicators. More than trust analysis, trust variation analysis is on the road. Fake reviews, sponsored blogs and e-commerce trust network inuence are pointed out. The paper A Cautious Surfer for PageRank comments on why TrustRank shouldn't be overused: However, the goal of a search engine is to nd good quality results; spam-free is a necessary but not sucient condition for high quality. If we use a trust-based algorithm alone to simply replace PageRank for ranking purposes, some good quality pages will be unfairly demoted and replaced, for example, by pages within the trusted seed sets, even though they may be much less authoritative.Considered from another angle, such trust-based algorithms propagate trust through paths originating from the seed set; as a result,some good quality pages may get low value if they are not well connected to those seeds. 3.3.9 Improvements to Google's ranking algorithms There have been a number of notable algorithm changes which made consider- able changes appear to results pages, though often the eects were later scaled back slightly. ˆ NoFollow Matt Cutts and Jason Shellen created the nofollow specication to help limit the eect and incentive for blog spam. If a search engine comes across a link tagged as nofollow, it will not treat the link as a vote, ie as a positive signal in rankings. Areas where untrusted users can post content are often tagged nofol- low, roughly 80% of content management systems (the software that websites run on) implement nofollow. The HTML code of a NoFollow link: a href=signin.php rel=nofollowsign in/a ˆ Increasing use of anchor text Even the original PageRank algorithm took into account the anchor text of links, so links were used to give both a number that indicated the sites popularity and information about the content of a document and so its relevance for user queries. 25
  • 26. ˆ Google Bombing Prevention, 2nd February 2007 Google Bombing is the process of massively linking to a page with a specic anchor text, to give PageRank but more importantly indications that the doc- ument is related to the anchor text. For example, in 1999 a number of bloggers grouped together to link to Microsoft.com with the anchor text more evil than Satan himself. This resulted in Microsoft being placed number one in searches for more evil than Satan himself despite not having the phrase anywhere on its page. Detecting a sudden inux of links with identical anchor text is very easy, and in 2007 Google changed their indexing structure so that Google bombs such as miserable failure would typically return commentary, discussions, and ar- ticles about the tactic itself. Matt Cutts said the Google bombs had not been a very high priority for us. Over time, we've seen more people assume that they are Google's opinion, or that Google has hand-coded the results for these Google-bombed queries. That's not true, and it seemed like it was worth trying to correct that perception. 46 Some Google bombs still work, particularly those tar getting unusual phrases, with varied anchor text, over a period of time, within paragraphs of text. ˆ Florida, November 2003 Results for highly commercial queries, likely informed from the cost of Adwords, became heavily ltered so more trusted academic websites and less commercial optimised websites ranked. Some of these changes resulted in less relevance, for example if a user was searching for buy bricks they probably didn't want to mainly see websites about the process of creating bricks, and were rolled back. For more see 47 and 48 . ˆ Bourbon, June 2005 A penalty was applied to sites with unusually fast or bursty patterns of link growth. ˆ Jagger, October 2005 A penalty applied to sites with unusually large amounts of reciprocal links, new methods for detecting hidden text. ˆ Big Daddy, December 2005 According to Matt Cutts, punished were sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling. 49 46 See http://answers.google.com/answers/main?cmd=threadviewid=179922 47 http://www.searchengineguide.com/barry-lloyd/been-gazumped-by-google-trying-to-make-sense-of-the-florida-upd php 48 http://www.seoresearchlabs.com/seo-research-labs-google-report.pdf 49 See http://www.webworkshop.net/googles-big-daddy-update.html 26
  • 27. ˆ Caeine, October 2010 A faster indexing system that changed results little, but allowed for fresher results and some of the later Panda updates 50 . ˆ Panda, April 2011 Penalty applied to content deemed low quality, detected primarily from user data. Websites which contained masses of articles, focusing on quantity over quality, were often hit 51 . 4 Detecting Spam and Manipulation You will often hear that your site has to look natural to the search engines. Just what natural means is hard to dene, but essentially it means the prole of a site whose popularity was never engineered or promoted, and was instead based on people luckily coming across it and deciding to recommend it to their friends with links. Whats more, you also need to make your site look popular, creating no links to your site yourself will look natural but you will have no chance of competing with people who do unless you have the cash to buy large amounts of advertising. This section briey covers what search engines consider to be acceptable, when and how they can detect violations, and what the potential penalties are. 4.1 Google Webmaster Guidelines Google have created a page called Webmaster Guidelines to inform users of what they consider to be acceptable methods of promoting your website. Whilst the lines for crossing general principles such as Would I do this if search engines didn't exist? are somewhat vague, they do oer some specic notes of what not to do: ˆ Avoid hidden text or hidden links. ˆ Don't use cloaking or sneaky redirects. ˆ Don't send automated queries to Google. ˆ Don't load pages with irrelevant keywords. ˆ Don't create multiple pages, sub domains, or domains with substantially duplicate content. 50 See http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html 51 See http://blog.searchmetrics.com/us/2011/04/12/googles-panda-update-rolls-out-to-uk/ and http://www.seobook.com/questioning-questions and http:// googlewebmastercentral.blogspot.com/2011/05/more-guidance-on-building-high-quality. html 27
  • 28. ˆ Don't create pages with malicious behavior, such as phishing or installing viruses, Trojans, or other bad ware. ˆ Avoid doorway pages created just for search engines, or other cookie cutter approaches such as aliate programs with little or no original con- tent. ˆ If your site participates in an aliate program, make sure that your site adds value. Provide unique and relevant content that gives users a reason to visit your site rst. Most of the methods listed above are naive and easy to detect, Google have been fairly successful in making successful manipulation aligned with creating genuine content, though without any promotion it is unlikely even the best content will be noticed. 4.2 Penalties Penalties 52 that Google to detected manipulation vary in length of time and eect, from small ranking penalties for certain keywords for a page to site wide bans, depending upon the sophistication of the manipulating methods and the quality of the oending site. If you believe you had had one applied, you can submit a Google Reconsideration Requesthttp://www.google.com/ support/webmasters/bin/answer.py?answer=35843 from Google Webmaster Tools, once you have xed the oending issues. 4.3 Detecting Manipulation in Content There is a fascinating paper by Microsoft which details a number of methods for detecting spam pages in search engine index's based on their content. A simple way is to use Bayesian lters (one is included with Ignite SEO to test your content as the search engine's would), so for example seeing the phrase buy pills would be a strong indicator of spam. Most of the research is on detecting blatantly computer generated lists of keywords, which is fairly easy to detect. Detecting the quality of human written content is very dicult, so unless you are endlessly repeating your keywords if you are writing your own content you can be reasonably happy with its quality in search engine's eyes. The following graphs are cut from Detecting Spam Web Pages through Content Analysis 53 by Microsoft Research employees. 4.4 Detecting Manipulation in Links Much research has focused on detecting spam pages through their backlinks or outlinks. Yahoo obtained a patent that uses the rate of link growth to detect 52 http://www.forbes.com/2007/04/29/sanar-google-skyfacet-tech-cx_ag_ 0430googhell.html 53 http://cs.wellesley.edu/~cs315/Papers/Ntoulas-DetectingSpamThroughContentAnalysis. pdf 28
  • 29. manipulation. Essentially a constant rate of new backlinks, perhaps with a small growth over time, is expected for a typical site. A saw-tooth pattern of inlinks is a strong indicator of backlink campaigns that start and stop (though could also be an indicator of say a site that releases new software monthly). In their paper, Fetterly et al, analyse the indegree (incoming/backlinks) and outdegree (links on the page) distributions of web pages: Most web pages have in and outdegrees that follow a powerlaw distribution. Occasionally, however, search engines encounter substantially more pages with the exact same in or outdegrees than what is predicted by the distribution for- mula. The authors nd that the vast majority of such outliers are spam pages. As discussed in the Trust Rank section earlier, large amount of links from sites that have already been detected as linking to spam (so called untrustwor- thy hubs) is a negative indicator. Links from unrelated websites, reciprocal links, links out of content, from sites that are known to host paid links and many other signals are likely taken into consideration. Zhang et al have identied a method for identifying unusually highly inter- connected groups of web pages. More methods of identifying manipulative sites are listed in Link Spam Alliances by Geyongyi and Garcia-Molina. 4.5 Other Methods If you think a competitor has been using methods that violate the webmaster guidelines, you can report them to Google 54 . Its good practice to ensure that any site you wish to keep for a long time, and expect to get reasonable amounts of trac, Google will sometimes manually review websites without prompting, Google Quality Raters inspect sites for relevance to results but can also take web pages as spam. Particular markets are inspected more often than others. 54 https://www.google.com/webmasters/tools/spamreport?hl=enpli=1 29
  • 30. Part II Practice 5 An Example Campaign Now we've covered the theory, its time for a real world example of putting it into practice. 5.1 Company Prole John runs a driving school in Springeld, Ohio. He has a website he has owned for a couple of years, that ranks around the second page for most searches related to driving schools in Ohio and receives about 20 visitors day, a third from search engines and two thirds from links from local websites. A quick search for what he imagines would be his main keyword, driving school Springeld Ohio, has a company directory site at the top followed by other directories, companies and people asking on forums for recommendations. This mix of relevant small companies web site's and small pages on big websites indicates the keyword to be of medium diculty to rank for. 5.2 Goals John thinks if he can get his site to rank 3rd instead of around the middle of the second page for his core keywords, he will increase his search trac by around 1000%, his overall trac by about 300%, and roughly double his sales. He aims to do this over a period of roughly one month. 5.3 Competitor Research John nds his main competitors by searching, and gets estimates of their trac sources using sites such as compete.com and serversiders.com. A tool such as Ignite SEO can automatically build SEO reports of competitors, listing their paid and organic keywords, demographics and backlinks. Looking at the HTML source code of some his competitors displays their targeted keywords in the meta name=keywords content=keyword1, keyword. 5.4 Keyword Research John takes his initial guesses of what potential customers might search for, and those from his competitors and his existing trac, and using the Google Keyword Tool 55 and Google Insights56 expands this list. 55 https://adwords.google.co.uk/select/KeywordToolExternal 56 http://www.google.com/insights 30
  • 31. 5.5 Content Creation John takes his keywords and create a small amount of content on his website containing them. He then creates a large amount of content quickly and creates 57 that, each one targeting a dierent keyword. sites hosted on free hosting sites The content generator section of Ignite SEO 58 is perfect for this. 5.6 Website Check Before investing in o site promotion (ie link building), it is worth performing a quick check that the site is search engine friendly. Creating an account in Google Webmaster Tools will let you know if Google has any issues indexing your website, and it is worth ensuring navigation isn't over reliant on JavaScript or Flash. 5.7 Link Building This is the core process that will actually improve John's rankings. By looking at his competitors backlinks using Yahoo's linkdomain: command, John replicates their links to his website by visiting each site one by one. Using a tool such as Ignite SEO, he can automatically build links to the hosted sites he quickly created in 5.5, without the risk of a link campaign negatively aecting the rankings of his core website. Other signals of quality such as facebook and twitter recommendations are built here. 5.8 Analysis The success of the campaign is measured with a good tracking system such as Google Analytics, as well as tracking the new incoming links with Google Webmaster Tools and Yahoo's link: command. The results are compared with the goals, and the whole process is rened and repeated. 57 http://igniteresearch.net/which-web-2-0-ranks-best-hubpages-vs-squidoo-vs-tumblr-vs-blogspot-etc/ 58 http://igniteresearch.net 31
  • 32. About the Author Christopher Doman is a partner of Ignite Research, a rm specialising in soft- ware and consultancies for search engine marketing. He holds a BA in Computer Science from the University of Cambridge. 32