SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Downloaden Sie, um offline zu lesen
2 December 2005 
Web Information Systems 
Web Search 
Prof. Beat Signer 
Department of Computer Science 
Vrije Universiteit Brussel 
http://www.beatsigner.com
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2 
Search Engine Result Pages (SERP)
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 3 
Search Engine Result Pages (SERP) ...
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 4 
Vertical Search Result Pages
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 5 
Search Engine Market Share (2013)
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 6 
Search Engine Result Page 
ī‚§ There is a variety of information shown on a search 
engine result page (SERP) 
ī‚§ organic search results 
ī‚§ non-organic search results 
ī‚§ meta-information about the result (e.g. number of result pages) 
ī‚§ vertical navigation 
ī‚§ advanced search options 
ī‚§ query refinement suggestions 
ī‚§ ...
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 7 
Search Engine History 
ī‚§ Early "search engines" include various systems 
starting with Bush's Memex 
ī‚§ Archie (1990) 
ī‚§ first Internet search engine 
ī‚§ indexing of files on FTP servers 
ī‚§ W3Catalog (September 1993) 
ī‚§ first "web search engine" 
ī‚§ mirroring and integration of manually maintained catalogues 
ī‚§ JumpStation (December 1993) 
ī‚§ first web search engine combining crawling, indexing and 
searching
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 8 
Search Engine History ... 
ī‚§ In the following two years (1994/1995) many 
new search engines appeared 
ī‚§ AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ... 
ī‚§ Two categories of early Web search solutions 
ī‚§ full text search 
- based on an index that is automatically created by a web crawler in 
combination with an indexer 
- e.g. AltaVista or InfoSeek 
ī‚§ manually maintained classification (hierarchy) of webpages 
- significant human editing effort 
- e.g. Yahoo
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 9 
Information Retrieval 
ī‚§ Precision and recall can be used to measure the 
performance of different information retrieval algorithms 
īģ īŊ īģ īŊ 
īģretrieved documentsīŊ 
relevant documents retrieved documents 
precision 
īƒ‡ 
ī€Ŋ 
īģ īŊ īģ īŊ 
īģrelevant documentsīŊ 
relevant documents retrieved documents 
recall 
īƒ‡ 
ī€Ŋ 
D1 D2 D4 
D6 D7 D10 
D3 D5 
D8 D9 
D1 D3 D8 
D9 D10 
query 
0.6 
5 
3 
precisionī€Ŋ ī€Ŋ 
0.75 
4 
3 
recallī€Ŋ ī€Ŋ
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 10 
Information Retrieval ... 
ī‚§ Often a combination of precision and recall, the so-called 
F-score (harmonic mean) is used as a single measure 
D1 D2 D4 
D6 D7 D10 
D3 D5 
D8 D9 
D1 D3 
D8 D9 D10 
query 
precisionī€Ŋ 0.57 
recallī€Ŋ1 
precision recall 
precision recall 
F- score 2 
ī€Ģ 
ī‚´ 
ī€Ŋ ī‚´ 
D1 D2 D4 
D6 D7 D10 
D3 D5 
D8 D9 
D1 D3 D8 
D9 D10 
query 
precisionī€Ŋ 0.6 
recallī€Ŋ 0.75 
F-scoreī€Ŋ 0.67 
D5 D2 
F-scoreī€Ŋ 0.73
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 11 
Bank 
Delhaize 
Ghent 
Metro 
Shopping 
Train 
D1 D2 D3 D4 D5 D6 
1 
Boolean Model 
ī‚§ Based on set theory and boolean logic 
ī‚§ Exact matching of documents to a user query 
ī‚§ Uses the boolean AND, OR and NOT operators 
ī‚§ query: Shopping AND Ghent AND NOT Delhaize 
ī‚§ computation: 101110 AND 100111 AND 000111 = 000110 
ī‚§ result: document set {D4,D5} 
1 0 0 1 1 
1 
1 
0 
1 
1 
1 
0 
0 
1 
0 
0 
1 
1 
1 
0 
0 
1 
0 
1 
1 
0 
1 
0 
1 
0 
0 
1 
0 
0 
0 
... ... ... ... ... ... ... 
inverted index
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 12 
Boolean Model ... 
ī‚§ Advantages 
ī‚§ relatively easy to implement and scalable 
ī‚§ fast query processing based on parallel scanning of indexes 
ī‚§ Disadvantages 
ī‚§ does not pay attention to synonymy 
- different words with similar meaning 
ī‚§ does not pay attention to polysemy 
- a single word with different meanings 
ī‚§ no ranking of output 
ī‚§ often the user has to learn a special syntax such as the use of 
double quotes to search for phrases 
ī‚§ Variants of the boolean model form the basis of many 
search engines
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 13 
Vector Space Model 
ī‚§ Algebraic model representing text documents and 
queries as vectors based on the index terms 
ī‚§ one dimension for each term 
ī‚§ Compute the similarity (angle) between the query vector 
and the document vectors 
ī‚§ Advantages 
ī‚§ simple model based on linear algebra 
ī‚§ partial matching with relevance scoring for results 
ī‚§ potenial query reevaluation based on user relevance feedback 
ī‚§ Disadvantages 
ī‚§ computationally expensive (similarity measures for each query) 
ī‚§ limited scalability
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 14 
Web Search Engines 
ī‚§ Most web search engines are based on traditional 
information retrieval techniques but they have to be 
adapted to deal with the characteristics of the Web 
ī‚§ immense amount of web resources (>50 billion webpages) 
ī‚§ hyperlinked resources 
ī‚§ dynamic content with frequent updates 
ī‚§ self-organised web resources 
ī‚§ Evaluation of performance 
ī‚§ no standard collections 
ī‚§ often based on user studies (satisfaction) 
ī‚§ Of course not only the precision and recall but also the 
query answer time is an important issue
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 15 
Web Search Engine Architecture 
WWW Crawler 
URL Pool 
Storage 
Manager 
Page 
Repository 
content already added? 
Document 
Index 
Special 
Indexes 
URL Handler Indexers 
URL 
Repository 
filter 
normalisation 
and duplicate 
elimination 
Client 
Query 
Handler 
inverted index 
Ranking
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 16 
Web Crawler 
ī‚§ A web crawler or spider is used to create an 
index of webpages to be used by a web search engine 
ī‚§ any web search is then based on this index 
ī‚§ Web crawler has to deal with the following issues 
ī‚§ freshness 
- the index should be updated regularly (based on webpage update frequency) 
ī‚§ quality 
- since not all webpages can be indexed, the crawler should give priority to 
"high quality" pages 
ī‚§ scalabilty 
- it should be possible to increase the crawl rate by just adding additional 
servers (modular architecture) 
- e.g. the estimated number of Google servers in 2007 was 1'000'000 (including 
not only the crawler but the entire Google platform)
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 17 
Web Crawler ... 
ī‚§ distribution 
- the crawler should be able to run in a distributed manner (computer centers all 
over the world) 
ī‚§ robustness 
- the Web contains a lot of pages with errors and a crawler has to deal with 
these problems 
- e.g. deal with a web server that creates an unlimited number of "virtual web 
pages" (crawler trap) 
ī‚§ efficiency 
- resources (e.g. network bandwidth) should be used in a most efficient way 
ī‚§ crawl rates 
- the crawler should pay attention to existing web server policies 
(e.g. revisit-after HTML meta tag or robots.txt file) 
User-agent: * 
Disallow: /cgi-bin/ 
Disallow: /tmp/ robots.txt
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 18 
Pre-1998 Web Search 
ī‚§ Find all documents for a given query term 
ī‚§ use information retrieval (IR) solutions 
- boolean model 
- vector space model 
- ... 
ī‚§ ranking based on "on-page factors" 
īƒ  problem: poor quality of search results (order) 
ī‚§ Larry Page and Sergey Brin proposed to compute the 
absolute quality of a page called PageRank 
ī‚§ based on the number and quality of pages linking 
to a page (votes) 
ī‚§ query-independent
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 19 
Origins of PageRank 
ī‚§ Developed as part of an 
academic project at Stanford 
University 
ī‚§ research platform to aid under-standing 
of large-scale web data 
and enable researchers to easily 
experiment with new search 
technologies 
ī‚§ Larry Page and Sergey Brin worked on the project about a new 
kind of search engine (1995-1998) which finally led to a functional 
prototype called Google 
Larry Page Sergey Brin
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 20 
PageRank 
ī‚§ A page Pi has a high PageRank Ri if 
ī‚§ there are many pages linking to it 
ī‚§ or, if there are some pages with a high PageRank linking to it 
ī‚§ Total score = IR score × PageRank 
P1 
R1 
P2 
R2 
P3 
R3 
P4 
R4 
P5 
R5 
P6 
R6 
P7 
R7 
P8 
R8
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 21 
Basic PageRank Algorithm 
ī‚§ where 
ī‚§ Bi is the set of pages 
that link to page Pi 
ī‚§ Lj is the number of 
outgoing links for page Pj 
īƒĨīƒŽ 
ī€Ŋ 
Pj Bi j 
j 
i L 
R P 
R P 
( ) 
( ) 
P1 
P2 
P3 
1 
1 
1 
1.5 
1.5 
0.75
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 22 
Matrix Representation 
ī‚§ Let us define a hyperlink 
matrix H 
P1 P2 
P3 
īƒŽ īƒ­ īƒŦ 
īƒŽ 
ī€Ŋ 
0 otherwise 
1 if j j i 
ij 
L P B 
H 
īƒē 
īƒē 
īƒē 
īƒģ 
īƒš 
īƒĒ 
īƒĒ 
īƒĒ 
īƒĢ 
īƒŠ 
ī€Ŋ 
0 1 2 0 
1 0 0 
0 1 2 1 
H 
ī› ī€¨ ī€Šī i PR ī€Ŋ R and 
R ī€Ŋ HR 
R is an eigenvector of H 
with eigenvalue 1 
īƒ 
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 23 
Matrix Representation ... 
ī‚§ We can use the power method to find R 
ī‚§ sparse matrix H with 40 billion columns and rows but only an 
average of 10 non-zero entries in each colum 
t t R ī€Ŋ HR ī€Ģ1 
īƒē 
īƒē 
īƒē 
īƒģ 
īƒš 
īƒĒ 
īƒĒ 
īƒĒ 
īƒĢ 
īƒŠ 
ī€Ŋ 
0 1 2 0 
1 0 0 
0 1 2 1 
For our example H 
this results in R ī€Ŋ ī›2 2 1ī or ī›0.4 0.4 0.2ī
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 24 
Dangling Pages (Rank Sink) 
ī‚§ Problem with pages that 
have no outgoing links (e.g. P2) 
ī‚§ Stochastic adjustment 
ī‚§ if page Pj has no outgoing links then replace column j with 1/Lj 
ī‚§ New stochastic matrix S always has a stationary vector R 
ī‚§ can also be interpreted as a markov chain 
P1 P2 
īƒē 
īƒģ 
īƒš 
īƒĒ 
īƒĢ 
īƒŠ 
ī€Ŋ 
1 0 
0 0 
H and R ī€Ŋ ī›0 0ī 
īƒēīƒģ 
īƒš 
īƒĒīƒĢ 
īƒŠ 
ī€Ŋ 
0 1 2 
0 1 2 
C īƒē 
īƒģ 
īƒš 
īƒĒ 
īƒĢ 
īƒŠ 
ī€Ŋ ī€Ģ ī€Ŋ 
1 1 2 
0 1 2 
and S H C 
C 
C
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 25 
Strongly Connected Pages (Graph) 
ī‚§ Add new transition proba-bilities 
between all pages 
ī‚§ with probability d we follow 
the hyperlink structure S 
ī‚§ with probability 1-d we 
choose a random page 
ī‚§ matrix G becomes irreducible 
ī‚§ Google matrix G reflects 
a random surfer 
ī‚§ no modelling of back button 
P1 P2 
P3 P4 
P5 
G S ī€¨ ī€Š 1 
n 
d d 
1 
ī€Ŋ ī€Ģ 1ī€­ R ī€ŊGR 
1-d 
1-d 1-d
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 26 
Examples G S ī€¨ ī€Š 1 
n 
d d 
1 
ī€Ŋ ī€Ģ 1ī€­ 
A1 
0.26 
A2 
0.37 
A3 
0.37
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 27 
Examples ... 
A1 
0.13 
A2 
0.185 
A3 
0.185 
B1 
0.13 
B2 
0.185 
B3 
0.185 
Pī€¨Aī€Šī€Ŋ 0.5 Pī€¨Bī€Šī€Ŋ 0.5 
G S ī€¨ ī€Š 1 
n 
d d 
1 
ī€Ŋ ī€Ģ 1ī€­
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 28 
Examples 
ī‚§ PageRank leakage 
A1 
0.10 
A2 
0.14 
A3 
0.14 
B1 
0.22 
B2 
0.20 
B3 
0.20 
Pī€¨Aī€Šī€Ŋ 0.38 Pī€¨Bī€Šī€Ŋ 0.62 
G S ī€¨ ī€Š 1 
n 
d d 
1 
ī€Ŋ ī€Ģ 1ī€­
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 29 
Examples ... 
A1 
0.3 
A2 
0.23 
A3 
0.18 
B1 
0.10 
B2 
0.095 
B3 
0.095 
Pī€¨Aī€Šī€Ŋ 0.71 Pī€¨Bī€Šī€Ŋ 0.29 
G S ī€¨ ī€Š 1 
n 
d d 
1 
ī€Ŋ ī€Ģ 1ī€­
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 30 
Examples 
ī‚§ PageRank feedback 
A1 
0.35 
A2 
0.24 
A3 
0.18 
B1 
0.09 
B2 
0.07 
B3 
0.07 
Pī€¨Aī€Šī€Ŋ 0.77 Pī€¨Bī€Šī€Ŋ 0.23 
G S ī€¨ ī€Š 1 
n 
d d 
1 
ī€Ŋ ī€Ģ 1ī€­
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 31 
Examples ... 
A1 
0.33 
A2 
0.17 
A3 
0.175 
B1 
0.08 
B2 
0.06 
B3 
0.06 
Pī€¨Aī€Šī€Ŋ 0.80 
Pī€¨Bī€Šī€Ŋ 0.20 A4 
0.125 
G S ī€¨ ī€Š 1 
n 
d d 
1 
ī€Ŋ ī€Ģ 1ī€­
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 32 
Google Webmaster Tools 
ī‚§ Various services and infor-mation 
about a website 
ī‚§ Site configuration 
ī‚§ submission of sitemap 
ī‚§ crawler access 
ī‚§ URLs of indexed pages 
ī‚§ settings 
- e.g. preferred domain 
ī‚§ Your site on the web 
ī‚§ search queries 
ī‚§ keywords 
ī‚§ internal and external links
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 33 
Google Webmaster Tools ... 
ī‚§ Diagnostics 
ī‚§ crawl rates and errors 
ī‚§ HTML suggestions 
ī‚§ Use HTML suggestions for on-page factor optimisation 
ī‚§ meta description 
- duplicate meta descriptions 
- too long meta descriptions 
ī‚§ title tag 
- missing or duplicate title tags 
- too long or too short title tags 
ī‚§ non-indexable content 
ī‚§ Similar tools offered by other search engines 
ī‚§ e.g. Bing Webmaster Tools
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 34 
XML Sitemaps 
ī‚§ List of URLs that should be crawled and indexed 
<?xml version="1.0" encoding="UTF-8"?> 
<urlset xmlns="http://www.example.com/sitemap/0.9"> 
<url> 
<loc>https://www.tenera.ch/trommelreibe-classic-p-2259-l-de.html</loc> 
<lastmod>2013-07-06</lastmod> 
<changefreq>weekly</changefreq> 
<priority>0.4</priority> 
</url> 
<url> 
<loc>https://www.tenera.ch/universalmesser-weiss-p-34-l-de.html</loc> 
<lastmod>2012-12-05</lastmod> 
<changefreq>weekly</changefreq> 
<priority>0.1</priority> 
</url> 
... 
</urlset>
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 35 
XML Sitemaps ... 
ī‚§ All major search engines support the sitemap format 
ī‚§ The URLs of sitemap are not guaranteed to be added to 
a search engine's index 
ī‚§ helps search engine to find pages that are not yet indexed 
ī‚§ Additional metadata might be provided to search engines 
ī‚§ relative page relevance (priority) 
ī‚§ date of last modififaction (lastmod) 
ī‚§ update frequency (changefreq)
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 36 
Questions 
ī‚§ Is PageRank fair? 
ī‚§ What about Google's power and influence? 
ī‚§ What about Web 2.0 or Web 3.0 and web search? 
ī‚§ "non-existent" webpages such as offered by Rich Internet 
Applications (e.g. using AJAX) may bring problems for traditional 
search engines (hidden web) 
ī‚§ new forms of social search 
- Delicious 
- ... 
ī‚§ social marketing
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 37 
The Google Effect 
ī‚§ A recent study by Sparrow et al. shows that 
people less likely remember things that they 
believe to be accessible online 
ī‚§ Internet as a transactive memory 
ī‚§ Does our memory work differently in the age of Google? 
ī‚§ What implications will the future of the Internet and new 
search have?
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 38 
Search Engine Marketing (SEM) 
ī‚§ For many companies Internet marketing 
has become a big business 
ī‚§ Search engine marketing (SEM) aims to 
increase the visibility of a website 
ī‚§ search engine optimisation (SEO) 
ī‚§ paid search advertising (non-organic search) 
ī‚§ social media marketing 
ī‚§ SEO should not be decoupled from a website's 
content, structure, design and used technologies 
ī‚§ SEO has to be seen as an continuous process in a 
rapidly changing environment 
ī‚§ different search engines with regular changes in ranking
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 39 
Structural Choices 
ī‚§ Keep the website structure as flat a possible 
ī‚§ minimise link depth 
ī‚§ avoid pages with much more than 100 links 
ī‚§ Think about your website's internal link structure 
ī‚§ which pages are directly linked from the homepage? 
ī‚§ create many internal links for important pages 
ī‚§ be "careful" about where to put outgoing links 
- PageRank leakage 
ī‚§ use keyword-rich anchor texts 
ī‚§ dynamically create links between related content 
- e.g. "customer who bought this also bought ..." or "visitors who viewed this 
also viewed ..." 
ī‚§ Increase the number of pages
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 40 
Technological Choices 
ī‚§ Use SEO-friendly content management system (CMS) 
ī‚§ Dynamic URLs vs. static URLs 
ī‚§ avoid session IDs and parameters in URL 
ī‚§ use URL rewriting to get descriptive URLs containing keywords 
ī‚§ Think carefully about the use of dynamic content 
ī‚§ Rich Internet Applications (RIAs) based on AJAX etc. 
ī‚§ content hidden behind pull-down menus etc. 
ī‚§ Address webpages consistently 
ī‚§ http://www.vub.ac.be ī‚š http://www.vub.ac.be/index.php 
ī‚§ Some notes about the Google toolbar 
ī‚§ shows logarithmic PageRank value (from 0 to 10) 
ī‚§ information not frequently updated (google dance)
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 41 
Consistent Addressing of Webpages
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 42 
Search Engine Optimisations 
ī‚§ Different things can be optimised 
ī‚§ on-page factors 
ī‚§ off-page factors 
ī‚§ It is assumed that some search engines use more than 
200 on-page and off-page factors for their ranking 
ī‚§ Difference between optimisation and breaking the 
"search engine rules" 
ī‚§ white hat and black hat optimisations 
ī‚§ A bad ranking or removal from index can cost a company 
a lot of money or even mark the end of the company 
ī‚§ e.g. supplemental index ("Google hell")
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 43 
Positive On-Page Factors 
ī‚§ Use of keywords at relevant places 
ī‚§ in title tag (preferably one of the first words) 
ī‚§ in URL 
ī‚§ in domain name 
ī‚§ in header tags (e.g. <h1>) 
ī‚§ multiple times in body text 
ī‚§ Provide metadata 
ī‚§ e.g. <meta name="description"> also used by search engines 
to create the text snippets on the SERPs 
ī‚§ Quality of HTML code 
ī‚§ Uniqueness of content across the website 
ī‚§ Page freshness (changes from time to time)
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 44 
Negative On-Page Factors 
ī‚§ Links to "bad neighbourhood" 
ī‚§ Link selling 
ī‚§ in 2007 Google announced a campaign against 
paid links that transfer PageRank 
ī‚§ Over optimisation penalty (keyword stuffing) 
ī‚§ Text with same colour as background (hidden content) 
ī‚§ Automatic redirect via the refresh meta tag 
ī‚§ Cloaking 
ī‚§ different pages for spider and user 
ī‚§ Malware being hosted on the page
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 45 
Negative On-Page Factors ... 
ī‚§ Duplicate or similar content 
ī‚§ Duplicate page titles or meta tags 
ī‚§ Slow page load time 
ī‚§ Any copyright violations 
ī‚§ ...
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 46 
Positive Off-Page Factors 
ī‚§ Links from pages with a high PageRank 
ī‚§ Keywords in anchor text of inbound links 
ī‚§ Links from topically relevant sites 
ī‚§ High clickthrough rate (CTR) from search engine for a 
given keyword 
ī‚§ Listed in DMOZ / Open Directory Project (ODP) and 
Yahoo directories 
ī‚§ High number of shares on social networks 
ī‚§ e.g. Facebook, Google+ or Twitter
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 47 
Positive Off-Page Factors ... 
ī‚§ Site age (stability) 
ī‚§ Google sandbox? 
ī‚§ Domain expiration date 
ī‚§ High PageRank 
ī‚§ ...
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 48 
Negative Off-Page Factors 
ī‚§ Site often not accessible to crawlers 
ī‚§ e.g. server problem 
ī‚§ High bounce rate 
ī‚§ users immediately press the back button 
ī‚§ Link buying 
ī‚§ rapidly increasing number of inbound links 
ī‚§ Use of link farms 
ī‚§ Participation in link sharing programmes 
ī‚§ Links from bad neighbourhood? 
ī‚§ Competitor attack (e.g. via duplicate content)?
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 49 
Black Hat Optimisations (Don'ts) 
ī‚§ Link farms 
ī‚§ Spamdexing in guestbooks, Wikipedia etc. 
ī‚§ "solution": <a rel="nofollow" href="...">...</a> 
ī‚§ Keyword Stuffing 
ī‚§ overuse of keywords 
- content keyword stuffing 
- image keyword stuffing 
- keywords in meta tags 
- invisible text with keywords 
ī‚§ Selling/buying links 
ī‚§ "big" business until 2007 
ī‚§ costs based on the PageRank of the linking site
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 50 
Black Hat Optimisations (Don'ts) ... 
ī‚§ Doorway pages (cloaking) 
ī‚§ doorway pages are normally just designed for search engines 
- user is automatically redirected to the target page 
ī‚§ e.g. BMW Germany and Ricoh Germany banned 
in February 2006
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 51 
Nofollow Link Example 
ī‚§ Nofollow value for hyperlinks introduced by Google in 
2005 to avoid spamdexing 
ī‚§ <a rel="nofollow" href="...">...</a> 
ī‚§ Links with a nofollow value were not counted in the 
PageRank computation 
ī‚§ division by number of outgoing links 
ī‚§ e.g. page with 9 outgoing links and 3 of them are nofollow links 
- PageRank divided by 6 and distributed across the 6 "really linked pages" 
ī‚§ SEO experts started to use (misuse) the nofollow links 
for PageRank sculpting 
ī‚§ control flow of PageRank within a website
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 52 
Nofollow Link Example ... 
ī‚§ In June 2009 Google decided to treat nofollow links 
differently to avoid PageRank sculpting 
ī‚§ division by total number of outgoing links 
ī‚§ e.g. page with 9 outgoing links and 3 of them are nofollow links 
- PageRank divided by 9 and distributed across the 6 "really linked pages" 
ī‚§ no longer a good solution to prevent Spamdexing since we loose 
(diffuse) some PageRank 
ī‚§ SEO experts start to use alternative techniques to 
replace nofollow links 
ī‚§ e.g. obfuscated JavaScript links
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 53 
Product Search 
ī‚§ Various shopping and 
price comparison sites 
import product data 
ī‚§ some of them are free, for 
others one has to pay 
ī‚§ Google Product Search 
ī‚§ started as Froogle, became 
Google Products and now 
Google Product Search 
ī‚§ product data uploaded to 
Google Base 
ī‚§ very effective vertical search
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 54 
Non-Organic Search 
ī‚§ In addition to the so-called organic search, websites can 
also participate in non-organic web search 
ī‚§ cost per impression (CPI) 
ī‚§ cost- per-click (CPC) 
ī‚§ The non-organic web search should be treated 
independently from the organic web search 
ī‚§ Quality of the landing page can have an impact on the 
non-organic web search performance! 
ī‚§ The Google AdWords programme is an example of a 
commercial non-organic web search service 
ī‚§ other services include Yahoo! Advertising Solutions, 
Facebook Ads, ...
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 55 
Google AdWords 
ī‚§ pay-per-click (PPC) or 
cost-per-thousand (CPM) 
ī‚§ Campains and ad groups 
ī‚§ Two types of advertising 
ī‚§ search 
ī‚§ content network 
- Google Adsense 
ī‚§ Highly customisable ads 
ī‚§ region 
ī‚§ language 
ī‚§ daytime 
ī‚§ ...
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 56 
Google AdWords ... 
ī‚§ Excellent control and monitoring for AdWords users 
ī‚§ cost per conversion 
ī‚§ In 2013 Google's total advertising revenues 
were 51 billion USD
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 57 
Conclusions 
ī‚§ Web information retrieval techniques have to deal with 
the specific characteristics of the Web 
ī‚§ PageRank algorithm 
ī‚§ absolute quality of a page based on incoming links 
ī‚§ based on random surfer model 
ī‚§ computed as eigenvector of Google matrix G 
ī‚§ PageRank is just one (important) factor 
ī‚§ Various implications for website development and SEO
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 58 
Exercise 10 
ī‚§ Web Search and PageRank
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 59 
References 
ī‚§ L. Page, S. Brin, R. Motwani and T. Winograd, 
The PageRank Citation Ranking: Bringing Order 
to the Web, January 1998 
ī‚§ S. Brin and L. Page, The Anatomy of a Large-Scale 
Hypertextual Web Search Engine, Computer Networks 
and ISDN Systems, 30(1-7), April 1998 
ī‚§ Amy N. Langville and Carl D. Meyer, Google's 
PageRank and Beyond – The Science of Search Engine 
Rankings, Princeton University Press, July 2006 
ī‚§ PageRank Calculator 
ī‚§ http://www.webworkshop.net/pagerank_calculator.php
December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 60 
References â€Ļ 
ī‚§ B. Sparrow, J. Liu and D.M. Wegner, Google 
Effects on Memory: Cognitive Consequences of Having 
Information at Our Fingertips, Science, July 2011 
ī‚§ Google Webmaster Tools 
ī‚§ http://www.google.com/webmasters/ 
ī‚§ The W3C Markup Validation Service 
ī‚§ http://validator.w3.org 
ī‚§ SEOmoz 
ī‚§ http://moz.com
2 December 2005 
Next Lecture 
Security, Privacy and Trust

Weitere ähnliche Inhalte

Ähnlich wie Web Search - Lecture 10 - Web Information Systems (4011474FNR)

Web Search and SEO - Web Technologies (1019888BNR)
Web Search and SEO - Web Technologies (1019888BNR)Web Search and SEO - Web Technologies (1019888BNR)
Web Search and SEO - Web Technologies (1019888BNR)Beat Signer
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_publicVincent Michel
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Analysis of Websites as Graphs for SEO
Analysis of Websites as Graphs for SEOAnalysis of Websites as Graphs for SEO
Analysis of Websites as Graphs for SEOParadigma Digital
 
Analysis of websites as graphs for SEO
Analysis of websites as graphs for SEOAnalysis of websites as graphs for SEO
Analysis of websites as graphs for SEORubÊn Martínez
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerceVincent Michel
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botify
apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botifyapidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botify
apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botifyapidays
 
Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
Proof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityProof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityOpen Cyber University of Korea
 
Semantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking Task
Semantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking TaskSemantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking Task
Semantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking TaskMediaMixerCommunity
 
Redesigning TCS.com with Remote Research
Redesigning TCS.com with Remote ResearchRedesigning TCS.com with Remote Research
Redesigning TCS.com with Remote ResearchChris Farnum
 
Alternatives to Google
Alternatives to GoogleAlternatives to Google
Alternatives to GoogleDirk Lewandowski
 

Ähnlich wie Web Search - Lecture 10 - Web Information Systems (4011474FNR) (20)

Web Search and SEO - Web Technologies (1019888BNR)
Web Search and SEO - Web Technologies (1019888BNR)Web Search and SEO - Web Technologies (1019888BNR)
Web Search and SEO - Web Technologies (1019888BNR)
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Telecom datascience master_public
Telecom datascience master_publicTelecom datascience master_public
Telecom datascience master_public
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Analysis of Websites as Graphs for SEO
Analysis of Websites as Graphs for SEOAnalysis of Websites as Graphs for SEO
Analysis of Websites as Graphs for SEO
 
Analysis of websites as graphs for SEO
Analysis of websites as graphs for SEOAnalysis of websites as graphs for SEO
Analysis of websites as graphs for SEO
 
Power BI as a storyteller
Power BI as a storytellerPower BI as a storyteller
Power BI as a storyteller
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Data Science in E-commerce
Data Science in E-commerceData Science in E-commerce
Data Science in E-commerce
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botify
apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botifyapidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botify
apidays LIVE Paris 2021 - Building an analytics API by David Wobrock, Botify
 
Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
Proof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityProof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics Interoperability
 
Semantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking Task
Semantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking TaskSemantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking Task
Semantic Multimedia Remixing - MediaEval 2013 Search and Hyperlinking Task
 
Redesigning TCS.com with Remote Research
Redesigning TCS.com with Remote ResearchRedesigning TCS.com with Remote Research
Redesigning TCS.com with Remote Research
 
Alternatives to Google
Alternatives to GoogleAlternatives to Google
Alternatives to Google
 

Mehr von Beat Signer

Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)
Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)
Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)Beat Signer
 
Indoor Positioning Using the OpenHPS Framework
Indoor Positioning Using the OpenHPS FrameworkIndoor Positioning Using the OpenHPS Framework
Indoor Positioning Using the OpenHPS FrameworkBeat Signer
 
Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...
Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...
Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...Beat Signer
 
Cross-Media Technologies and Applications - Future Directions for Personal In...
Cross-Media Technologies and Applications - Future Directions for Personal In...Cross-Media Technologies and Applications - Future Directions for Personal In...
Cross-Media Technologies and Applications - Future Directions for Personal In...Beat Signer
 
Bridging the Gap: Managing and Interacting with Information Across Media Boun...
Bridging the Gap: Managing and Interacting with Information Across Media Boun...Bridging the Gap: Managing and Interacting with Information Across Media Boun...
Bridging the Gap: Managing and Interacting with Information Across Media Boun...Beat Signer
 
Codeschool in a Box: A Low-Barrier Approach to Packaging Programming Curricula
Codeschool in a Box: A Low-Barrier Approach to Packaging Programming CurriculaCodeschool in a Box: A Low-Barrier Approach to Packaging Programming Curricula
Codeschool in a Box: A Low-Barrier Approach to Packaging Programming CurriculaBeat Signer
 
The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions
The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions
The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions Beat Signer
 
Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...
Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...
Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...Beat Signer
 
Dashboards - Lecture 11 - Information Visualisation (4019538FNR)
Dashboards - Lecture 11 - Information Visualisation (4019538FNR)Dashboards - Lecture 11 - Information Visualisation (4019538FNR)
Dashboards - Lecture 11 - Information Visualisation (4019538FNR)Beat Signer
 
Interaction - Lecture 10 - Information Visualisation (4019538FNR)
Interaction - Lecture 10 - Information Visualisation (4019538FNR)Interaction - Lecture 10 - Information Visualisation (4019538FNR)
Interaction - Lecture 10 - Information Visualisation (4019538FNR)Beat Signer
 
View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...
View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...
View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...Beat Signer
 
Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)
Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)
Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)Beat Signer
 
Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...
Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...
Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...Beat Signer
 
Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...
Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...
Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...Beat Signer
 
Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)
Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)
Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)Beat Signer
 
Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)
Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)
Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)Beat Signer
 
Data Representation - Lecture 3 - Information Visualisation (4019538FNR)
Data Representation - Lecture 3 - Information Visualisation (4019538FNR)Data Representation - Lecture 3 - Information Visualisation (4019538FNR)
Data Representation - Lecture 3 - Information Visualisation (4019538FNR)Beat Signer
 
Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...
Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...
Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...Beat Signer
 
Introduction - Lecture 1 - Information Visualisation (4019538FNR)
Introduction - Lecture 1 - Information Visualisation (4019538FNR)Introduction - Lecture 1 - Information Visualisation (4019538FNR)
Introduction - Lecture 1 - Information Visualisation (4019538FNR)Beat Signer
 
Towards a Framework for Dynamic Data Physicalisation
Towards a Framework for Dynamic Data PhysicalisationTowards a Framework for Dynamic Data Physicalisation
Towards a Framework for Dynamic Data PhysicalisationBeat Signer
 

Mehr von Beat Signer (20)

Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)
Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)
Introduction - Lecture 1 - Human-Computer Interaction (1023841ANR)
 
Indoor Positioning Using the OpenHPS Framework
Indoor Positioning Using the OpenHPS FrameworkIndoor Positioning Using the OpenHPS Framework
Indoor Positioning Using the OpenHPS Framework
 
Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...
Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...
Personalised Learning Environments Based on Knowledge Graphs and the Zone of ...
 
Cross-Media Technologies and Applications - Future Directions for Personal In...
Cross-Media Technologies and Applications - Future Directions for Personal In...Cross-Media Technologies and Applications - Future Directions for Personal In...
Cross-Media Technologies and Applications - Future Directions for Personal In...
 
Bridging the Gap: Managing and Interacting with Information Across Media Boun...
Bridging the Gap: Managing and Interacting with Information Across Media Boun...Bridging the Gap: Managing and Interacting with Information Across Media Boun...
Bridging the Gap: Managing and Interacting with Information Across Media Boun...
 
Codeschool in a Box: A Low-Barrier Approach to Packaging Programming Curricula
Codeschool in a Box: A Low-Barrier Approach to Packaging Programming CurriculaCodeschool in a Box: A Low-Barrier Approach to Packaging Programming Curricula
Codeschool in a Box: A Low-Barrier Approach to Packaging Programming Curricula
 
The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions
The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions
The RSL Hypermedia Metamodel and Its Application in Cross-Media Solutions
 
Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...
Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...
Case Studies and Course Review - Lecture 12 - Information Visualisation (4019...
 
Dashboards - Lecture 11 - Information Visualisation (4019538FNR)
Dashboards - Lecture 11 - Information Visualisation (4019538FNR)Dashboards - Lecture 11 - Information Visualisation (4019538FNR)
Dashboards - Lecture 11 - Information Visualisation (4019538FNR)
 
Interaction - Lecture 10 - Information Visualisation (4019538FNR)
Interaction - Lecture 10 - Information Visualisation (4019538FNR)Interaction - Lecture 10 - Information Visualisation (4019538FNR)
Interaction - Lecture 10 - Information Visualisation (4019538FNR)
 
View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...
View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...
View Manipulation and Reduction - Lecture 9 - Information Visualisation (4019...
 
Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)
Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)
Visualisation Techniques - Lecture 8 - Information Visualisation (4019538FNR)
 
Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...
Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...
Design Guidelines and Principles - Lecture 7 - Information Visualisation (401...
 
Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...
Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...
Data Processing and Visualisation Frameworks - Lecture 6 - Information Visual...
 
Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)
Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)
Data Presentation - Lecture 5 - Information Visualisation (4019538FNR)
 
Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)
Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)
Analysis and Validation - Lecture 4 - Information Visualisation (4019538FNR)
 
Data Representation - Lecture 3 - Information Visualisation (4019538FNR)
Data Representation - Lecture 3 - Information Visualisation (4019538FNR)Data Representation - Lecture 3 - Information Visualisation (4019538FNR)
Data Representation - Lecture 3 - Information Visualisation (4019538FNR)
 
Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...
Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...
Human Perception and Colour Theory - Lecture 2 - Information Visualisation (4...
 
Introduction - Lecture 1 - Information Visualisation (4019538FNR)
Introduction - Lecture 1 - Information Visualisation (4019538FNR)Introduction - Lecture 1 - Information Visualisation (4019538FNR)
Introduction - Lecture 1 - Information Visualisation (4019538FNR)
 
Towards a Framework for Dynamic Data Physicalisation
Towards a Framework for Dynamic Data PhysicalisationTowards a Framework for Dynamic Data Physicalisation
Towards a Framework for Dynamic Data Physicalisation
 

KÃŧrzlich hochgeladen

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 

KÃŧrzlich hochgeladen (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 

Web Search - Lecture 10 - Web Information Systems (4011474FNR)

  • 1. 2 December 2005 Web Information Systems Web Search Prof. Beat Signer Department of Computer Science Vrije Universiteit Brussel http://www.beatsigner.com
  • 2. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 2 Search Engine Result Pages (SERP)
  • 3. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 3 Search Engine Result Pages (SERP) ...
  • 4. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 4 Vertical Search Result Pages
  • 5. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 5 Search Engine Market Share (2013)
  • 6. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 6 Search Engine Result Page ī‚§ There is a variety of information shown on a search engine result page (SERP) ī‚§ organic search results ī‚§ non-organic search results ī‚§ meta-information about the result (e.g. number of result pages) ī‚§ vertical navigation ī‚§ advanced search options ī‚§ query refinement suggestions ī‚§ ...
  • 7. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 7 Search Engine History ī‚§ Early "search engines" include various systems starting with Bush's Memex ī‚§ Archie (1990) ī‚§ first Internet search engine ī‚§ indexing of files on FTP servers ī‚§ W3Catalog (September 1993) ī‚§ first "web search engine" ī‚§ mirroring and integration of manually maintained catalogues ī‚§ JumpStation (December 1993) ī‚§ first web search engine combining crawling, indexing and searching
  • 8. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 8 Search Engine History ... ī‚§ In the following two years (1994/1995) many new search engines appeared ī‚§ AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ... ī‚§ Two categories of early Web search solutions ī‚§ full text search - based on an index that is automatically created by a web crawler in combination with an indexer - e.g. AltaVista or InfoSeek ī‚§ manually maintained classification (hierarchy) of webpages - significant human editing effort - e.g. Yahoo
  • 9. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 9 Information Retrieval ī‚§ Precision and recall can be used to measure the performance of different information retrieval algorithms īģ īŊ īģ īŊ īģretrieved documentsīŊ relevant documents retrieved documents precision īƒ‡ ī€Ŋ īģ īŊ īģ īŊ īģrelevant documentsīŊ relevant documents retrieved documents recall īƒ‡ ī€Ŋ D1 D2 D4 D6 D7 D10 D3 D5 D8 D9 D1 D3 D8 D9 D10 query 0.6 5 3 precisionī€Ŋ ī€Ŋ 0.75 4 3 recallī€Ŋ ī€Ŋ
  • 10. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 10 Information Retrieval ... ī‚§ Often a combination of precision and recall, the so-called F-score (harmonic mean) is used as a single measure D1 D2 D4 D6 D7 D10 D3 D5 D8 D9 D1 D3 D8 D9 D10 query precisionī€Ŋ 0.57 recallī€Ŋ1 precision recall precision recall F- score 2 ī€Ģ ī‚´ ī€Ŋ ī‚´ D1 D2 D4 D6 D7 D10 D3 D5 D8 D9 D1 D3 D8 D9 D10 query precisionī€Ŋ 0.6 recallī€Ŋ 0.75 F-scoreī€Ŋ 0.67 D5 D2 F-scoreī€Ŋ 0.73
  • 11. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 11 Bank Delhaize Ghent Metro Shopping Train D1 D2 D3 D4 D5 D6 1 Boolean Model ī‚§ Based on set theory and boolean logic ī‚§ Exact matching of documents to a user query ī‚§ Uses the boolean AND, OR and NOT operators ī‚§ query: Shopping AND Ghent AND NOT Delhaize ī‚§ computation: 101110 AND 100111 AND 000111 = 000110 ī‚§ result: document set {D4,D5} 1 0 0 1 1 1 1 0 1 1 1 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 0 1 0 0 1 0 0 0 ... ... ... ... ... ... ... inverted index
  • 12. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 12 Boolean Model ... ī‚§ Advantages ī‚§ relatively easy to implement and scalable ī‚§ fast query processing based on parallel scanning of indexes ī‚§ Disadvantages ī‚§ does not pay attention to synonymy - different words with similar meaning ī‚§ does not pay attention to polysemy - a single word with different meanings ī‚§ no ranking of output ī‚§ often the user has to learn a special syntax such as the use of double quotes to search for phrases ī‚§ Variants of the boolean model form the basis of many search engines
  • 13. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 13 Vector Space Model ī‚§ Algebraic model representing text documents and queries as vectors based on the index terms ī‚§ one dimension for each term ī‚§ Compute the similarity (angle) between the query vector and the document vectors ī‚§ Advantages ī‚§ simple model based on linear algebra ī‚§ partial matching with relevance scoring for results ī‚§ potenial query reevaluation based on user relevance feedback ī‚§ Disadvantages ī‚§ computationally expensive (similarity measures for each query) ī‚§ limited scalability
  • 14. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 14 Web Search Engines ī‚§ Most web search engines are based on traditional information retrieval techniques but they have to be adapted to deal with the characteristics of the Web ī‚§ immense amount of web resources (>50 billion webpages) ī‚§ hyperlinked resources ī‚§ dynamic content with frequent updates ī‚§ self-organised web resources ī‚§ Evaluation of performance ī‚§ no standard collections ī‚§ often based on user studies (satisfaction) ī‚§ Of course not only the precision and recall but also the query answer time is an important issue
  • 15. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 15 Web Search Engine Architecture WWW Crawler URL Pool Storage Manager Page Repository content already added? Document Index Special Indexes URL Handler Indexers URL Repository filter normalisation and duplicate elimination Client Query Handler inverted index Ranking
  • 16. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 16 Web Crawler ī‚§ A web crawler or spider is used to create an index of webpages to be used by a web search engine ī‚§ any web search is then based on this index ī‚§ Web crawler has to deal with the following issues ī‚§ freshness - the index should be updated regularly (based on webpage update frequency) ī‚§ quality - since not all webpages can be indexed, the crawler should give priority to "high quality" pages ī‚§ scalabilty - it should be possible to increase the crawl rate by just adding additional servers (modular architecture) - e.g. the estimated number of Google servers in 2007 was 1'000'000 (including not only the crawler but the entire Google platform)
  • 17. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 17 Web Crawler ... ī‚§ distribution - the crawler should be able to run in a distributed manner (computer centers all over the world) ī‚§ robustness - the Web contains a lot of pages with errors and a crawler has to deal with these problems - e.g. deal with a web server that creates an unlimited number of "virtual web pages" (crawler trap) ī‚§ efficiency - resources (e.g. network bandwidth) should be used in a most efficient way ī‚§ crawl rates - the crawler should pay attention to existing web server policies (e.g. revisit-after HTML meta tag or robots.txt file) User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ robots.txt
  • 18. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 18 Pre-1998 Web Search ī‚§ Find all documents for a given query term ī‚§ use information retrieval (IR) solutions - boolean model - vector space model - ... ī‚§ ranking based on "on-page factors" īƒ  problem: poor quality of search results (order) ī‚§ Larry Page and Sergey Brin proposed to compute the absolute quality of a page called PageRank ī‚§ based on the number and quality of pages linking to a page (votes) ī‚§ query-independent
  • 19. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 19 Origins of PageRank ī‚§ Developed as part of an academic project at Stanford University ī‚§ research platform to aid under-standing of large-scale web data and enable researchers to easily experiment with new search technologies ī‚§ Larry Page and Sergey Brin worked on the project about a new kind of search engine (1995-1998) which finally led to a functional prototype called Google Larry Page Sergey Brin
  • 20. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 20 PageRank ī‚§ A page Pi has a high PageRank Ri if ī‚§ there are many pages linking to it ī‚§ or, if there are some pages with a high PageRank linking to it ī‚§ Total score = IR score × PageRank P1 R1 P2 R2 P3 R3 P4 R4 P5 R5 P6 R6 P7 R7 P8 R8
  • 21. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 21 Basic PageRank Algorithm ī‚§ where ī‚§ Bi is the set of pages that link to page Pi ī‚§ Lj is the number of outgoing links for page Pj īƒĨīƒŽ ī€Ŋ Pj Bi j j i L R P R P ( ) ( ) P1 P2 P3 1 1 1 1.5 1.5 0.75
  • 22. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 22 Matrix Representation ī‚§ Let us define a hyperlink matrix H P1 P2 P3 īƒŽ īƒ­ īƒŦ īƒŽ ī€Ŋ 0 otherwise 1 if j j i ij L P B H īƒē īƒē īƒē īƒģ īƒš īƒĒ īƒĒ īƒĒ īƒĢ īƒŠ ī€Ŋ 0 1 2 0 1 0 0 0 1 2 1 H ī› ī€¨ ī€Šī i PR ī€Ŋ R and R ī€Ŋ HR R is an eigenvector of H with eigenvalue 1 īƒ 
  • 23. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 23 Matrix Representation ... ī‚§ We can use the power method to find R ī‚§ sparse matrix H with 40 billion columns and rows but only an average of 10 non-zero entries in each colum t t R ī€Ŋ HR ī€Ģ1 īƒē īƒē īƒē īƒģ īƒš īƒĒ īƒĒ īƒĒ īƒĢ īƒŠ ī€Ŋ 0 1 2 0 1 0 0 0 1 2 1 For our example H this results in R ī€Ŋ ī›2 2 1ī or ī›0.4 0.4 0.2ī
  • 24. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 24 Dangling Pages (Rank Sink) ī‚§ Problem with pages that have no outgoing links (e.g. P2) ī‚§ Stochastic adjustment ī‚§ if page Pj has no outgoing links then replace column j with 1/Lj ī‚§ New stochastic matrix S always has a stationary vector R ī‚§ can also be interpreted as a markov chain P1 P2 īƒē īƒģ īƒš īƒĒ īƒĢ īƒŠ ī€Ŋ 1 0 0 0 H and R ī€Ŋ ī›0 0ī īƒēīƒģ īƒš īƒĒīƒĢ īƒŠ ī€Ŋ 0 1 2 0 1 2 C īƒē īƒģ īƒš īƒĒ īƒĢ īƒŠ ī€Ŋ ī€Ģ ī€Ŋ 1 1 2 0 1 2 and S H C C C
  • 25. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 25 Strongly Connected Pages (Graph) ī‚§ Add new transition proba-bilities between all pages ī‚§ with probability d we follow the hyperlink structure S ī‚§ with probability 1-d we choose a random page ī‚§ matrix G becomes irreducible ī‚§ Google matrix G reflects a random surfer ī‚§ no modelling of back button P1 P2 P3 P4 P5 G S ī€¨ ī€Š 1 n d d 1 ī€Ŋ ī€Ģ 1ī€­ R ī€ŊGR 1-d 1-d 1-d
  • 26. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 26 Examples G S ī€¨ ī€Š 1 n d d 1 ī€Ŋ ī€Ģ 1ī€­ A1 0.26 A2 0.37 A3 0.37
  • 27. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 27 Examples ... A1 0.13 A2 0.185 A3 0.185 B1 0.13 B2 0.185 B3 0.185 Pī€¨Aī€Šī€Ŋ 0.5 Pī€¨Bī€Šī€Ŋ 0.5 G S ī€¨ ī€Š 1 n d d 1 ī€Ŋ ī€Ģ 1ī€­
  • 28. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 28 Examples ī‚§ PageRank leakage A1 0.10 A2 0.14 A3 0.14 B1 0.22 B2 0.20 B3 0.20 Pī€¨Aī€Šī€Ŋ 0.38 Pī€¨Bī€Šī€Ŋ 0.62 G S ī€¨ ī€Š 1 n d d 1 ī€Ŋ ī€Ģ 1ī€­
  • 29. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 29 Examples ... A1 0.3 A2 0.23 A3 0.18 B1 0.10 B2 0.095 B3 0.095 Pī€¨Aī€Šī€Ŋ 0.71 Pī€¨Bī€Šī€Ŋ 0.29 G S ī€¨ ī€Š 1 n d d 1 ī€Ŋ ī€Ģ 1ī€­
  • 30. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 30 Examples ī‚§ PageRank feedback A1 0.35 A2 0.24 A3 0.18 B1 0.09 B2 0.07 B3 0.07 Pī€¨Aī€Šī€Ŋ 0.77 Pī€¨Bī€Šī€Ŋ 0.23 G S ī€¨ ī€Š 1 n d d 1 ī€Ŋ ī€Ģ 1ī€­
  • 31. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 31 Examples ... A1 0.33 A2 0.17 A3 0.175 B1 0.08 B2 0.06 B3 0.06 Pī€¨Aī€Šī€Ŋ 0.80 Pī€¨Bī€Šī€Ŋ 0.20 A4 0.125 G S ī€¨ ī€Š 1 n d d 1 ī€Ŋ ī€Ģ 1ī€­
  • 32. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 32 Google Webmaster Tools ī‚§ Various services and infor-mation about a website ī‚§ Site configuration ī‚§ submission of sitemap ī‚§ crawler access ī‚§ URLs of indexed pages ī‚§ settings - e.g. preferred domain ī‚§ Your site on the web ī‚§ search queries ī‚§ keywords ī‚§ internal and external links
  • 33. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 33 Google Webmaster Tools ... ī‚§ Diagnostics ī‚§ crawl rates and errors ī‚§ HTML suggestions ī‚§ Use HTML suggestions for on-page factor optimisation ī‚§ meta description - duplicate meta descriptions - too long meta descriptions ī‚§ title tag - missing or duplicate title tags - too long or too short title tags ī‚§ non-indexable content ī‚§ Similar tools offered by other search engines ī‚§ e.g. Bing Webmaster Tools
  • 34. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 34 XML Sitemaps ī‚§ List of URLs that should be crawled and indexed <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.example.com/sitemap/0.9"> <url> <loc>https://www.tenera.ch/trommelreibe-classic-p-2259-l-de.html</loc> <lastmod>2013-07-06</lastmod> <changefreq>weekly</changefreq> <priority>0.4</priority> </url> <url> <loc>https://www.tenera.ch/universalmesser-weiss-p-34-l-de.html</loc> <lastmod>2012-12-05</lastmod> <changefreq>weekly</changefreq> <priority>0.1</priority> </url> ... </urlset>
  • 35. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 35 XML Sitemaps ... ī‚§ All major search engines support the sitemap format ī‚§ The URLs of sitemap are not guaranteed to be added to a search engine's index ī‚§ helps search engine to find pages that are not yet indexed ī‚§ Additional metadata might be provided to search engines ī‚§ relative page relevance (priority) ī‚§ date of last modififaction (lastmod) ī‚§ update frequency (changefreq)
  • 36. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 36 Questions ī‚§ Is PageRank fair? ī‚§ What about Google's power and influence? ī‚§ What about Web 2.0 or Web 3.0 and web search? ī‚§ "non-existent" webpages such as offered by Rich Internet Applications (e.g. using AJAX) may bring problems for traditional search engines (hidden web) ī‚§ new forms of social search - Delicious - ... ī‚§ social marketing
  • 37. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 37 The Google Effect ī‚§ A recent study by Sparrow et al. shows that people less likely remember things that they believe to be accessible online ī‚§ Internet as a transactive memory ī‚§ Does our memory work differently in the age of Google? ī‚§ What implications will the future of the Internet and new search have?
  • 38. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 38 Search Engine Marketing (SEM) ī‚§ For many companies Internet marketing has become a big business ī‚§ Search engine marketing (SEM) aims to increase the visibility of a website ī‚§ search engine optimisation (SEO) ī‚§ paid search advertising (non-organic search) ī‚§ social media marketing ī‚§ SEO should not be decoupled from a website's content, structure, design and used technologies ī‚§ SEO has to be seen as an continuous process in a rapidly changing environment ī‚§ different search engines with regular changes in ranking
  • 39. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 39 Structural Choices ī‚§ Keep the website structure as flat a possible ī‚§ minimise link depth ī‚§ avoid pages with much more than 100 links ī‚§ Think about your website's internal link structure ī‚§ which pages are directly linked from the homepage? ī‚§ create many internal links for important pages ī‚§ be "careful" about where to put outgoing links - PageRank leakage ī‚§ use keyword-rich anchor texts ī‚§ dynamically create links between related content - e.g. "customer who bought this also bought ..." or "visitors who viewed this also viewed ..." ī‚§ Increase the number of pages
  • 40. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 40 Technological Choices ī‚§ Use SEO-friendly content management system (CMS) ī‚§ Dynamic URLs vs. static URLs ī‚§ avoid session IDs and parameters in URL ī‚§ use URL rewriting to get descriptive URLs containing keywords ī‚§ Think carefully about the use of dynamic content ī‚§ Rich Internet Applications (RIAs) based on AJAX etc. ī‚§ content hidden behind pull-down menus etc. ī‚§ Address webpages consistently ī‚§ http://www.vub.ac.be ī‚š http://www.vub.ac.be/index.php ī‚§ Some notes about the Google toolbar ī‚§ shows logarithmic PageRank value (from 0 to 10) ī‚§ information not frequently updated (google dance)
  • 41. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 41 Consistent Addressing of Webpages
  • 42. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 42 Search Engine Optimisations ī‚§ Different things can be optimised ī‚§ on-page factors ī‚§ off-page factors ī‚§ It is assumed that some search engines use more than 200 on-page and off-page factors for their ranking ī‚§ Difference between optimisation and breaking the "search engine rules" ī‚§ white hat and black hat optimisations ī‚§ A bad ranking or removal from index can cost a company a lot of money or even mark the end of the company ī‚§ e.g. supplemental index ("Google hell")
  • 43. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 43 Positive On-Page Factors ī‚§ Use of keywords at relevant places ī‚§ in title tag (preferably one of the first words) ī‚§ in URL ī‚§ in domain name ī‚§ in header tags (e.g. <h1>) ī‚§ multiple times in body text ī‚§ Provide metadata ī‚§ e.g. <meta name="description"> also used by search engines to create the text snippets on the SERPs ī‚§ Quality of HTML code ī‚§ Uniqueness of content across the website ī‚§ Page freshness (changes from time to time)
  • 44. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 44 Negative On-Page Factors ī‚§ Links to "bad neighbourhood" ī‚§ Link selling ī‚§ in 2007 Google announced a campaign against paid links that transfer PageRank ī‚§ Over optimisation penalty (keyword stuffing) ī‚§ Text with same colour as background (hidden content) ī‚§ Automatic redirect via the refresh meta tag ī‚§ Cloaking ī‚§ different pages for spider and user ī‚§ Malware being hosted on the page
  • 45. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 45 Negative On-Page Factors ... ī‚§ Duplicate or similar content ī‚§ Duplicate page titles or meta tags ī‚§ Slow page load time ī‚§ Any copyright violations ī‚§ ...
  • 46. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 46 Positive Off-Page Factors ī‚§ Links from pages with a high PageRank ī‚§ Keywords in anchor text of inbound links ī‚§ Links from topically relevant sites ī‚§ High clickthrough rate (CTR) from search engine for a given keyword ī‚§ Listed in DMOZ / Open Directory Project (ODP) and Yahoo directories ī‚§ High number of shares on social networks ī‚§ e.g. Facebook, Google+ or Twitter
  • 47. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 47 Positive Off-Page Factors ... ī‚§ Site age (stability) ī‚§ Google sandbox? ī‚§ Domain expiration date ī‚§ High PageRank ī‚§ ...
  • 48. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 48 Negative Off-Page Factors ī‚§ Site often not accessible to crawlers ī‚§ e.g. server problem ī‚§ High bounce rate ī‚§ users immediately press the back button ī‚§ Link buying ī‚§ rapidly increasing number of inbound links ī‚§ Use of link farms ī‚§ Participation in link sharing programmes ī‚§ Links from bad neighbourhood? ī‚§ Competitor attack (e.g. via duplicate content)?
  • 49. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 49 Black Hat Optimisations (Don'ts) ī‚§ Link farms ī‚§ Spamdexing in guestbooks, Wikipedia etc. ī‚§ "solution": <a rel="nofollow" href="...">...</a> ī‚§ Keyword Stuffing ī‚§ overuse of keywords - content keyword stuffing - image keyword stuffing - keywords in meta tags - invisible text with keywords ī‚§ Selling/buying links ī‚§ "big" business until 2007 ī‚§ costs based on the PageRank of the linking site
  • 50. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 50 Black Hat Optimisations (Don'ts) ... ī‚§ Doorway pages (cloaking) ī‚§ doorway pages are normally just designed for search engines - user is automatically redirected to the target page ī‚§ e.g. BMW Germany and Ricoh Germany banned in February 2006
  • 51. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 51 Nofollow Link Example ī‚§ Nofollow value for hyperlinks introduced by Google in 2005 to avoid spamdexing ī‚§ <a rel="nofollow" href="...">...</a> ī‚§ Links with a nofollow value were not counted in the PageRank computation ī‚§ division by number of outgoing links ī‚§ e.g. page with 9 outgoing links and 3 of them are nofollow links - PageRank divided by 6 and distributed across the 6 "really linked pages" ī‚§ SEO experts started to use (misuse) the nofollow links for PageRank sculpting ī‚§ control flow of PageRank within a website
  • 52. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 52 Nofollow Link Example ... ī‚§ In June 2009 Google decided to treat nofollow links differently to avoid PageRank sculpting ī‚§ division by total number of outgoing links ī‚§ e.g. page with 9 outgoing links and 3 of them are nofollow links - PageRank divided by 9 and distributed across the 6 "really linked pages" ī‚§ no longer a good solution to prevent Spamdexing since we loose (diffuse) some PageRank ī‚§ SEO experts start to use alternative techniques to replace nofollow links ī‚§ e.g. obfuscated JavaScript links
  • 53. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 53 Product Search ī‚§ Various shopping and price comparison sites import product data ī‚§ some of them are free, for others one has to pay ī‚§ Google Product Search ī‚§ started as Froogle, became Google Products and now Google Product Search ī‚§ product data uploaded to Google Base ī‚§ very effective vertical search
  • 54. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 54 Non-Organic Search ī‚§ In addition to the so-called organic search, websites can also participate in non-organic web search ī‚§ cost per impression (CPI) ī‚§ cost- per-click (CPC) ī‚§ The non-organic web search should be treated independently from the organic web search ī‚§ Quality of the landing page can have an impact on the non-organic web search performance! ī‚§ The Google AdWords programme is an example of a commercial non-organic web search service ī‚§ other services include Yahoo! Advertising Solutions, Facebook Ads, ...
  • 55. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 55 Google AdWords ī‚§ pay-per-click (PPC) or cost-per-thousand (CPM) ī‚§ Campains and ad groups ī‚§ Two types of advertising ī‚§ search ī‚§ content network - Google Adsense ī‚§ Highly customisable ads ī‚§ region ī‚§ language ī‚§ daytime ī‚§ ...
  • 56. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 56 Google AdWords ... ī‚§ Excellent control and monitoring for AdWords users ī‚§ cost per conversion ī‚§ In 2013 Google's total advertising revenues were 51 billion USD
  • 57. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 57 Conclusions ī‚§ Web information retrieval techniques have to deal with the specific characteristics of the Web ī‚§ PageRank algorithm ī‚§ absolute quality of a page based on incoming links ī‚§ based on random surfer model ī‚§ computed as eigenvector of Google matrix G ī‚§ PageRank is just one (important) factor ī‚§ Various implications for website development and SEO
  • 58. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 58 Exercise 10 ī‚§ Web Search and PageRank
  • 59. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 59 References ī‚§ L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, January 1998 ī‚§ S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 30(1-7), April 1998 ī‚§ Amy N. Langville and Carl D. Meyer, Google's PageRank and Beyond – The Science of Search Engine Rankings, Princeton University Press, July 2006 ī‚§ PageRank Calculator ī‚§ http://www.webworkshop.net/pagerank_calculator.php
  • 60. December 5, 2014 Beat Signer - Department of Computer Science - bsigner@vub.ac.be 60 References â€Ļ ī‚§ B. Sparrow, J. Liu and D.M. Wegner, Google Effects on Memory: Cognitive Consequences of Having Information at Our Fingertips, Science, July 2011 ī‚§ Google Webmaster Tools ī‚§ http://www.google.com/webmasters/ ī‚§ The W3C Markup Validation Service ī‚§ http://validator.w3.org ī‚§ SEOmoz ī‚§ http://moz.com
  • 61. 2 December 2005 Next Lecture Security, Privacy and Trust