1. How do I develop and found a search engine?
Files and ļ¬le search in the Internet
from the perspective of FindFiles.net
Claudius Gros
Institute for Theoretical Physics
Goethe University Frankfurt, Germany
http://www.findfiles.net
1
2. overview
data in the Internet
ā Mime types and statistics
ā the ļ¬le search engine FindFiles.net
science with data ļ¬les
ā neuropsychological constraints to human data production
2
4. Internet hosts
Hosts ā Domains ā Sites
[source: Netcraft.com]
ā¢ 2011 ā¼ 100 active Mio domains
4
5. Internet users ā email
users in 2010
ā¢ 2 Ā· 109 worldwide
ā¢ 825 Ā· 106 Asia
ā¢ 475 Ā· 106 Europe
ā¢ 266 Ā· 106 North American
emails in 2010
ā¢ 107 Ā· 1012 ā number of emails sent
ā¢ 89% ā share of spam emails (success rate: 1:12 Mio ?)
ā¢ 1.9 Ā· 109 ā number of email users
[source: Pingdom.com]
5
6. Internet ā social media
2010
ā¢ 152 Ā· 106 ā number of blogs
ā¢ 25 Ā· 109 ā number of tweets on Twitter
ā¢ 600 Ā· 106 ā Facebook acounts
30 Ā· 109 ā pieces of content: links, notes, images, ...
20 Ā· 106 ā number of activated apps (per day)
[source: Pingdom.com]
6
7. social media ā images and videos
streaming videos ā tubes
ā¢ 2 Ā· 109 ā watched per day on Youtube (one per Internet user)
ā¢ 35 ā hours of video uploaded (every minute)
images, pictures
ā¢ 5 Ā· 109 ā photos hosted by Flickr
ā¢ 3000 ā photos uploaded (per minute)
[source: Pingdom.com]
7
8. social media ā blogs & bookmarking
blogs are everywhere
ā¢ 2010: 50 ā 100 Ā· 106 blogs
social is everthing
ā¢ Digg, Mister Wong, Delicious, ...
ā¢ social shopping, ...
http://www.delicious.com
8
9. Internet ā rules of thumb
2010
ā¢ 1 movie per day per user (Youtube)
ā¢ 1 search query per day per every 2 users (Google)
ā¢ every Internet user uses email
ā¢ 5 ātrueā emails per day per Internet user
ā¢ 25% of Internet users use novel social media
ā¢ most domains are blogs
9
11. slow beginnings for startups in the Internet
Twitter
ā¢ linear scale ā exponential or linear growth?
Apr 2011 ā 155 daily tweets
11
12. growth is generically not exponential
Google
30
revenues per year (billions, US-$) Google yearly revenues
25
20
15
10
5
0
2000 2002 2004 2006 2008 2010
year
ā¢ linear scale
12
13. Internet: the winner takes all
ļ¬ow of attention in complex networks
www.small.net
www.small.org
www.big.com
www.medium.com
www.small.com
www.small.de
ā¢ in-degree distribution pk
heavy tails
ā¢ preferential attachment
13
14. in-degree distribution
power law ā scale invariant
8
6
log(number of hosts)
4
2
0
-2
number of incomming links
-4 linear fit, slope -2.2: 7.51-2.2*x
0 1 2 3 4 5 6
10 10 10 10 10 10 10
number of incomming links
[source: Findļ¬les.net]
ā¢ scaling constant for 20 years ā starting at one!
14
15. limiting diverging in-degree distribution
ā
1 k k2āĪ±
pk ā Ī± , k ā dk ā¼
k k Ī± 2āĪ± Kc
ā¢ diverging mean in-degree lim k ā ā
Ī±ā2
Internet: Ī± ā 1.9 ā 2.2
limiting dominating tail
Ā» limiting winners take all Ā«
ā¢ makes life difļ¬cult for small startups
15
16. the big two uphill ļ¬ghts
a new Internet startup needs to ...
ā¢ ļ¬ght for attention
ā¢ ļ¬ght for novelty
trafļ¬c and quality
heavy tail in-degree distribution makes it difļ¬cult to attract trafļ¬ce
extremly high service standards act as effective entry barriers
16
18. public data on the Internet
280 Million domains in 2011
10-30 data ļ¬les per domain
Internet Media type ā Mime type
ā¢ categorization of all ļ¬le types
email attachments
browser add-ons
about ā¼ 600 Mime types in use
18
24. the Wikipedia/DMOZ corpus
all outgoing links of
ā¢ Wikipedia (all languages)
ā¢ DMOZ ā open directory project (all languages)
7.7 Mio hosts (domains)
252 Mio data ļ¬les (FindFiles.net crawler)
analysis of ļ¬le size distribution
ā¢ tails & scaling behaviour
24
25. number of ļ¬les per domain
ļ¬les per host vs. in-degree
ā¢ most ļ¬les hosted on small domains
25
26. ļ¬le size distribution
number of ļ¬les of given size
8
6
log(number of files)
4
2
0 all Mime categories
Mime category application/
-2 Mime category audio/
Mime category image/
-4 Mime category text/
Mime category video/
10 B 100 B 1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 G
file size [Bytes]
ā¢ 252 Mio ļ¬les in total ā 9 orders of magnitude
26
27. power-law scaling of image-size distribution
ā¢ compression gif: lossless; jpeg: lossy
6
4
log(number of files)
2
0
all Mime categories
Mime type image/jpeg
-2 linear fit, slope -2
linear fit, slope -4
-4 Mime type image/gif
linear fit, slope -2.45
1K 10 K 100 K 1M 10 M 100 M 1G
file size [Bytes]
ā¢ kink at 4 Mbytes: amateur ā professional
27
28. lognormal multimedia size distribution
all audio and video Mime types
4
2
log(number of files)
0
-2 all Mime categories
Mime category video/
quadratic fit (lognormal distribution)
-4 Mime category audio/
quadratic fit (lognormal distribution)
1K 10 K 100 K 1M 10 M 100 M 1G 10 G
file size [Bytes]
ā¢ quadratic ļ¬t ā lognormal distribution
28
29. lognormal distribution vs. powerlaw scaling
ļ¬les-size distribution p(s)
[log(s)āĀµ]2 /Ļ2
e sāĪ±
not a Taylor-series correction
log (p(s)) ā Ī± log(s) ā Ī² log2 (s)
images: Ī± < 0, Ī² = 0.
audio/video: Ī± > 0, Ī²>0 4
6
4 2
log(number of files)
log(number of files)
2 0
0
all Mime categories -2
Mime type image/jpeg all Mime categories
-2 linear fit, slope -2 Mime category video/
linear fit, slope -4 quadratic fit (lognormal distribution)
Mime type image/gif
-4 Mime category audio/
-4
linear fit, slope -2.45 quadratic fit (lognormal distribution)
1K 10 K 100 K 1M 10 M 100 M 1G 1K 10 K 100 K 1M 10 M 100 M 1G
file size [Bytes] file size [Bytes]
29
30. one vs. two-dimensional cost functions
economical cost functions for data production
ā¢ size
storage costs
production costs
psychophysical cost functions for data production
ā¢ size (images)
time needed to take an image is independent of resolution
ā¢ size and time (audio & video)
time and resolution are psychophysical distinct variables
30
31. Weber-Fechner law
ā¢ neuopsychological cost functions are logarithmic in
sensory stimulus intensity
number of objects
time perception
music: tone pitch ā log(frequency) (octave)
photometry: brightness ā log(intensity) (lumen)
acoustics: sound level ā log(intensity) [decibel]
information production: number of objects / time
31
32. information entropy
Shannon information entropy
ā p(s) log(p(s))ds p(s) ds = 1
for a distribution function p(s)
ā¢ a measure for the information content
Shannon coding theorem
Mimimal amount of bytes needed to encode a transmission is
given by the information entropy of the signal statistics
32
34. physical vs. neuropsychological cost functions
6
images
4
log(number of files)
2
physical exponential [not seen]
0
1-dim neuro power law [linear] all Mime categories
Mime type image/jpeg
-2 linear fit, slope -2
linear fit, slope -4
-4 Mime type image/gif
linear fit, slope -2.45
1K 10 K 100 K 1M 10 M 100 M 1G
file size [Bytes]
4
audio/video
2
log(number of files)
physical exponential [not seen]
0
2-dim neuro lognormal [quadradic]
-2 all Mime categories
Mime category video/
quadratic fit (lognormal distribution)
-4 Mime category audio/
quadratic fit (lognormal distribution)
1K 10 K 100 K 1M 10 M 100 M 1G 10 G
file size [Bytes]
34
35. global human data production
basic assumptions
ā¢ information production as underlying driving force
information entropy as a suitable measure
ā¢ law of large numbers
average over production processes / producting agents
compression/technology correspond to rescaling
data production on a global level characterized by
neuropsychological cost functions and not be eco-
nomic constraints
35
36. the Internet & complex system theory
complex system theory ā still an emergent ļ¬eld
many models and paradigms yet to be formulated
network theory / game theory / allocation problems
macroecology / systems biology / cognitive systems theory
...
ā¢ information entropy maximization
human data production on a global level
neuropsychological cost functions
...
36
37. graduate level textbook
ā¢ Information theory and complexity
ā¢ Phase transitions and
self-organized criticality
ā¢ Life at the edge of chaos and
punctuated equilibrium
ā¢ Cognitive system theory
and diffusive emotional control
second edition 2010
37