SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
How do I develop and found a search engine?
Files and ļ¬le search in the Internet
from the perspective of FindFiles.net


Claudius Gros

Institute for Theoretical Physics
Goethe University Frankfurt, Germany

http://www.findfiles.net

                                              1
overview


                          data in the Internet

ā€“ Mime types and statistics

ā€“ the ļ¬le search engine FindFiles.net



                        science with data ļ¬les

ā€“ neuropsychological constraints to human data production




                                                            2
Internet ā€“ Statistics




                        3
Internet hosts

Hosts ā€“ Domains ā€“ Sites




                                    [source: Netcraft.com]

ā€¢   2011 āˆ¼ 100 active Mio domains

                                                             4
Internet users ā€“ email

users in 2010
ā€¢   2 Ā· 109 worldwide

ā€¢   825 Ā· 106 Asia

ā€¢   475 Ā· 106 Europe

ā€¢   266 Ā· 106 North American

emails in 2010
ā€¢   107 Ā· 1012 ā€“ number of emails sent

ā€¢   89% ā€“ share of spam emails (success rate: 1:12 Mio ?)

ā€¢   1.9 Ā· 109 ā€“ number of email users
                                                       [source: Pingdom.com]
                                                                               5
Internet ā€“ social media

2010

ā€¢   152 Ā· 106 ā€“ number of blogs

ā€¢   25 Ā· 109 ā€“ number of tweets on Twitter


ā€¢   600 Ā· 106 ā€“ Facebook acounts

       30 Ā· 109 ā€“ pieces of content: links, notes, images, ...

       20 Ā· 106 ā€“ number of activated apps (per day)
                                                             [source: Pingdom.com]




                                                                                     6
social media ā€“ images and videos

streaming videos ā€“ tubes
ā€¢   2 Ā· 109 ā€“ watched per day on Youtube (one per Internet user)

ā€¢   35 ā€“ hours of video uploaded (every minute)


images, pictures
ā€¢   5 Ā· 109 ā€“ photos hosted by Flickr

ā€¢   3000 ā€“ photos uploaded (per minute)

                                                         [source: Pingdom.com]




                                                                                 7
social media ā€“ blogs & bookmarking

blogs are everywhere

ā€¢ 2010: 50 āˆ’ 100 Ā· 106 blogs

social is everthing

ā€¢   Digg, Mister Wong, Delicious, ...

ā€¢   social shopping, ...


 http://www.delicious.com



                                        8
Internet ā€“ rules of thumb

2010

ā€¢ 1 movie per day per user (Youtube)
ā€¢ 1 search query per day per every 2 users (Google)

ā€¢ every Internet user uses email
ā€¢ 5 ā€˜trueā€™ emails per day per Internet user

ā€¢ 25% of Internet users use novel social media
ā€¢ most domains are blogs



                                                      9
Internet Startups




                    10
slow beginnings for startups in the Internet

Twitter




ā€¢ linear scale ā€“ exponential or linear growth?
    Apr 2011 ā€“ 155 daily tweets
                                                 11
growth is generically not exponential

Google
                                                      30


                 revenues per year (billions, US-$)          Google yearly revenues
                                                      25



                                                      20



                                                      15



                                                      10



                                                       5



                                                      0
                                                      2000        2002       2004        2006   2008   2010
                                                                                      year

ā€¢ linear scale

                                                                                                              12
Internet: the winner takes all

ļ¬‚ow of attention in complex networks


                                              www.small.net
               www.small.org

                                www.big.com

                                                       www.medium.com


                www.small.com
                                        www.small.de



ā€¢ in-degree distribution pk
     heavy tails

ā€¢ preferential attachment
                                                                        13
in-degree distribution

power law ā€“ scale invariant
                                   8


                                   6
            log(number of hosts)



                                   4


                                   2


                                   0


                                   -2

                                            number of incomming links
                                   -4       linear fit, slope -2.2: 7.51-2.2*x


                                        0    1             2            3         4    5     6
                                    10      10           10           10         10   10   10
                                                     number of incomming links
                                                                                           [source: Findļ¬les.net]

ā€¢ scaling constant for 20 years ā€“ starting at one!
                                                                                                                    14
limiting diverging in-degree distribution

                                                           āˆž
                      1                     k       k2āˆ’Ī±
                pk āˆ Ī± ,          k āˆ          dk āˆ¼
                     k                     k Ī±      2āˆ’Ī±    Kc


ā€¢ diverging mean in-degree        lim k ā†’ āˆž
                                  Ī±ā†’2



     Internet: Ī± ā‰ˆ 1.9 āˆ’ 2.2

     limiting dominating tail

                        Ā» limiting winners take all Ā«

ā€¢ makes life difļ¬cult for small startups

                                                                15
the big two uphill ļ¬ghts

a new Internet startup needs to ...

ā€¢ ļ¬ght for attention
ā€¢ ļ¬ght for novelty

trafļ¬c and quality

heavy tail in-degree distribution makes it difļ¬cult to attract trafļ¬ce

extremly high service standards act as effective entry barriers




                                                                         16
FindFiles.net ā€“ a new ļ¬le search engine




                                          17
public data on the Internet

280 Million domains in 2011

     10-30 data ļ¬les per domain




Internet Media type ā€“ Mime type

ā€¢ categorization of all ļ¬le types
     email attachments
     browser add-ons
     about āˆ¼ 600 Mime types in use


                                     18
Mime types

major Mime categories          33.2%       application/
                                2.9%       audio/
ā€¢ together: 99%
                               58.0%       image/
                                5.1%       text/
                                0.7%       video/
Mime types ā€“ examples
 application/pdf                                  audio/mpeg
 application/msword                               audio/midi
 application/vnd.android.package-archive          chemical/x-pdb
 application/vnd.ms-powerpoint                    image/jpeg
 application/jar                                  image/vnd.djvu
 application/x-deb                                text/xml
 application/x-gzip                               model/vrml
                                                                   19
FindFiles.net

search engine for data ļ¬les
                                        G. Kaczor & C. Gros 2011




ā€¢ supports all Mine types
                              http://www.findfiles.net
                                                               20
FindFiles.net ā€“ some stats

 daily queries




                                                  [source: FindFiles.net]

ā€¢ 400 Mio data ļ¬les                         20 Mio host crawled
    10 Million mp3 ļ¬les
    10 000 apps for Symbian/Android smartphones
    ...
                                                                            21
blogs, legal issues & ļ¬nancing

blog & press coverage
                   http://www.findfiles.net/publicrelations


copyright & non-legal ļ¬les

ā€¢ ļ¬les protected by copyright/licence are not indexed (nofollow)
ā€¢ links to pirate ļ¬les removed from index

ļ¬nancing

ā€¢ network ā€“ Unibator
ā€¢ banks are cautious ā€“ most startups fail

                                                                   22
Science with Data Files




                          23
the Wikipedia/DMOZ corpus

all outgoing links of

ā€¢ Wikipedia (all languages)
ā€¢ DMOZ ā€“ open directory project (all languages)

    7.7 Mio hosts (domains)

    252 Mio data ļ¬les (FindFiles.net crawler)


analysis of ļ¬le size distribution

ā€¢ tails & scaling behaviour


                                                  24
number of ļ¬les per domain

ļ¬les per host vs. in-degree




ā€¢ most ļ¬les hosted on small domains

                                      25
ļ¬le size distribution

number of ļ¬les of given size
                                   8

                                   6
            log(number of files)


                                   4

                                   2

                                   0     all Mime categories
                                         Mime category application/
                                   -2    Mime category audio/
                                         Mime category image/
                                   -4    Mime category text/
                                         Mime category video/

                                   10 B 100 B 1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 G
                                                        file size [Bytes]

ā€¢ 252 Mio ļ¬les in total ā€“ 9 orders of magnitude
                                                                                       26
power-law scaling of image-size distribution

ā€¢ compression                           gif: lossless; jpeg: lossy

                                   6

                                   4
            log(number of files)




                                   2

                                   0
                                              all Mime categories
                                              Mime type image/jpeg
                                   -2         linear fit, slope -2
                                              linear fit, slope -4
                                   -4         Mime type image/gif
                                              linear fit, slope -2.45

                                   1K        10 K      100 K       1M      10 M   100 M   1G
                                                            file size [Bytes]

ā€¢ kink at 4 Mbytes: amateur ā€“ professional
                                                                                               27
lognormal multimedia size distribution

all audio and video Mime types
                                   4


                                   2
            log(number of files)




                                   0


                                   -2   all Mime categories
                                        Mime category video/
                                        quadratic fit (lognormal distribution)
                                   -4   Mime category audio/
                                        quadratic fit (lognormal distribution)


                                   1K   10 K   100 K       1M       10 M         100 M   1G   10 G
                                                        file size [Bytes]

ā€¢ quadratic ļ¬t ā€“ lognormal distribution
                                                                                                     28
lognormal distribution vs. powerlaw scaling

ļ¬les-size distribution p(s)

                           [log(s)āˆ’Āµ]2 /Ļƒ2
                           e                                                                   sāˆ’Ī±

not a Taylor-series correction

                    log (p(s))    āˆ                               Ī± log(s) āˆ’ Ī² log2 (s)


     images:      Ī± < 0,   Ī² = 0.
     audio/video: Ī± > 0,       Ī²>0                                                                                                          4
                                                             6

                                                             4                                                                              2
                                      log(number of files)




                                                                                                                     log(number of files)
                                                             2                                                                              0

                                                             0
                                                                    all Mime categories                                                     -2
                                                                    Mime type image/jpeg                                                         all Mime categories
                                                             -2     linear fit, slope -2                                                         Mime category video/
                                                                    linear fit, slope -4                                                         quadratic fit (lognormal distribution)
                                                                    Mime type image/gif
                                                                                                                                            -4   Mime category audio/
                                                             -4
                                                                    linear fit, slope -2.45                                                      quadratic fit (lognormal distribution)


                                                             1K    10 K      100 K       1M      10 M   100 M   1G                          1K   10 K   100 K       1M       10 M         100 M   1G
                                                                                  file size [Bytes]                                                              file size [Bytes]
                                                                                                                                                                                                   29
one vs. two-dimensional cost functions

economical cost functions for data production

ā€¢ size
    storage costs
    production costs

psychophysical cost functions for data production

ā€¢ size (images)
    time needed to take an image is independent of resolution

ā€¢ size and time (audio & video)
    time and resolution are psychophysical distinct variables


                                                                30
Weber-Fechner law

ā€¢ neuopsychological cost functions are logarithmic in
    sensory stimulus intensity
    number of objects
    time perception


 music:          tone pitch      āˆ   log(frequency)   (octave)
 photometry:     brightness      āˆ   log(intensity)   (lumen)
 acoustics:      sound level     āˆ   log(intensity)   [decibel]


    information production: number of objects / time



                                                                  31
information entropy

Shannon information entropy


                 āˆ’    p(s) log(p(s))ds        p(s) ds = 1


for a distribution function p(s)

ā€¢ a measure for the information content

Shannon coding theorem

     Mimimal amount of bytes needed to encode a transmission is
     given by the information entropy of the signal statistics


                                                                  32
neuropsychological cost functions
conditional entropy maximization

              Ī“ āˆ’    p(s) log(p(s)) ds āˆ’ Ī»   p(s)c(s) ds   = 0


 Shannon information entropy:       āˆ’ p(s) log(p(s))ds
                cost function:            c(s)
         ļ¬le size distribution:           p(s)

maximal ļ¬le size distributions
                      ļ£±
                      ļ£“ exponential c(s) āˆ s
                      ļ£²                             physical
    p(s) āˆ eāˆ’Ī»c(s) āˆ¼     power law  c(s) āˆ log(s) 1-dim neuro
                                    c(s) āˆ log2 (s) 2-dim neuro
                      ļ£“
                      ļ£³ lognormal


                                                                  33
physical vs. neuropsychological cost functions
                                                                  6
images
                                                                  4




                                           log(number of files)
                                                                  2
 physical      exponential   [not seen]
                                                                  0
 1-dim neuro    power law    [linear]                                    all Mime categories
                                                                         Mime type image/jpeg
                                                                  -2     linear fit, slope -2
                                                                         linear fit, slope -4
                                                                  -4     Mime type image/gif
                                                                         linear fit, slope -2.45

                                                                  1K    10 K      100 K        1M         10 M          100 M    1G
                                                                                        file size [Bytes]


                                                                  4
audio/video
                                                                  2




                                           log(number of files)
 physical      exponential   [not seen]
                                                                  0
 2-dim neuro    lognormal    [quadradic]
                                                                  -2   all Mime categories
                                                                       Mime category video/
                                                                       quadratic fit (lognormal distribution)
                                                                  -4   Mime category audio/
                                                                       quadratic fit (lognormal distribution)


                                                                  1K   10 K    100 K      1M       10 M         100 M    1G     10 G
                                                                                       file size [Bytes]
                                                                                                                                       34
global human data production

basic assumptions

ā€¢ information production as underlying driving force
    information entropy as a suitable measure

ā€¢ law of large numbers
    average over production processes / producting agents
    compression/technology correspond to rescaling


          data production on a global level characterized by
          neuropsychological cost functions and not be eco-
          nomic constraints


                                                               35
the Internet & complex system theory

complex system theory ā€“ still an emergent ļ¬eld

    many models and paradigms yet to be formulated

    network theory / game theory / allocation problems
    macroecology / systems biology / cognitive systems theory
    ...

ā€¢ information entropy maximization
    human data production on a global level
    neuropsychological cost functions
    ...



                                                                36
graduate level textbook


                           ā€¢ Information theory and complexity

                           ā€¢ Phase transitions and
                              self-organized criticality

                           ā€¢ Life at the edge of chaos and
                              punctuated equilibrium

                           ā€¢ Cognitive system theory
                              and diffusive emotional control



                          second edition 2010

                                                                37

Weitere Ƥhnliche Inhalte

Ƅhnlich wie C gros-webscience-talk

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
sjwoodman
Ā 
Colorado leadership v4
Colorado leadership v4Colorado leadership v4
Colorado leadership v4
Brandon Williams
Ā 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
animove
Ā 
Microsoft Dynamics Academic Alliance: How to win future of business
Microsoft Dynamics Academic Alliance: How to win future of businessMicrosoft Dynamics Academic Alliance: How to win future of business
Microsoft Dynamics Academic Alliance: How to win future of business
Frederik De Bruyne
Ā 
20111006 synergie informatique-label-qualityof-experience_en
20111006 synergie informatique-label-qualityof-experience_en20111006 synergie informatique-label-qualityof-experience_en
20111006 synergie informatique-label-qualityof-experience_en
Synergie Informatique France
Ā 
Web2.0 Ppt
Web2.0  PptWeb2.0  Ppt
Web2.0 Ppt
Park.C.H
Ā 
Datos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADatos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbA
Daniel Vila Suero
Ā 
å…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆ
å…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆå…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆ
å…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆ
Rikkyo University
Ā 
Please, do not decentralize the Internet (with permissionless) blockchains
Please, do not decentralize the Internet (with permissionless) blockchainsPlease, do not decentralize the Internet (with permissionless) blockchains
Please, do not decentralize the Internet (with permissionless) blockchains
pgarcial
Ā 

Ƅhnlich wie C gros-webscience-talk (20)

CSC 8101 Non Relational Databases
CSC 8101 Non Relational DatabasesCSC 8101 Non Relational Databases
CSC 8101 Non Relational Databases
Ā 
Colorado leadership v4
Colorado leadership v4Colorado leadership v4
Colorado leadership v4
Ā 
Webꊀ蔓恮ē¾ēŠ¶ćØå°†ę„ (Open Source Conference 2011 Kyoto)
Webꊀ蔓恮ē¾ēŠ¶ćØå°†ę„ (Open Source Conference 2011 Kyoto) Webꊀ蔓恮ē¾ēŠ¶ćØå°†ę„ (Open Source Conference 2011 Kyoto)
Webꊀ蔓恮ē¾ēŠ¶ćØå°†ę„ (Open Source Conference 2011 Kyoto)
Ā 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
Ā 
Why cloud native matters
Why cloud native mattersWhy cloud native matters
Why cloud native matters
Ā 
Mobile first, Responsive Design and The Core Model
Mobile first, Responsive Design and The Core ModelMobile first, Responsive Design and The Core Model
Mobile first, Responsive Design and The Core Model
Ā 
1. IPv6 bei Microsoft - Markus Erlacher
1. IPv6 bei Microsoft - Markus Erlacher1. IPv6 bei Microsoft - Markus Erlacher
1. IPv6 bei Microsoft - Markus Erlacher
Ā 
Microsoft Dynamics Academic Alliance: How to win future of business
Microsoft Dynamics Academic Alliance: How to win future of businessMicrosoft Dynamics Academic Alliance: How to win future of business
Microsoft Dynamics Academic Alliance: How to win future of business
Ā 
From Web 2.0 to Social Media
From Web 2.0 to Social MediaFrom Web 2.0 to Social Media
From Web 2.0 to Social Media
Ā 
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSiHadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Hadoop World 2011: Hadoop in a Mission Critical Environment - Jim Haas - CBSi
Ā 
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Ā 
Open web platform talk by daniel hladky at rif 2012 (19 april 2012 moscow)
Open web platform talk by daniel hladky at rif 2012 (19 april 2012   moscow)Open web platform talk by daniel hladky at rif 2012 (19 april 2012   moscow)
Open web platform talk by daniel hladky at rif 2012 (19 april 2012 moscow)
Ā 
20111006 synergie informatique-label-qualityof-experience_en
20111006 synergie informatique-label-qualityof-experience_en20111006 synergie informatique-label-qualityof-experience_en
20111006 synergie informatique-label-qualityof-experience_en
Ā 
Web2.0 Ppt
Web2.0  PptWeb2.0  Ppt
Web2.0 Ppt
Ā 
Datos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbADatos enlazados BNE and MARiMbA
Datos enlazados BNE and MARiMbA
Ā 
EOS9CAT Community Event 0822 (Vancouver, BC, Canada)
EOS9CAT Community Event 0822 (Vancouver, BC, Canada)EOS9CAT Community Event 0822 (Vancouver, BC, Canada)
EOS9CAT Community Event 0822 (Vancouver, BC, Canada)
Ā 
å…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆ
å…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆå…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆ
å…Ø恦恮ć‚Øćƒ³ć‚øćƒ‹ć‚¢ć®ćŸć‚ć®Webęؙęŗ–ꊀ蔓ćØć®ć¤ćć‚ć„ę–¹ OSCē¦å²” 2011ē‰ˆ
Ā 
96 plenary 1_m-tansey
96 plenary 1_m-tansey96 plenary 1_m-tansey
96 plenary 1_m-tansey
Ā 
Please, do not decentralize the Internet (with permissionless) blockchains
Please, do not decentralize the Internet (with permissionless) blockchainsPlease, do not decentralize the Internet (with permissionless) blockchains
Please, do not decentralize the Internet (with permissionless) blockchains
Ā 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
Ā 

KĆ¼rzlich hochgeladen

Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Ā 
Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...
Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...
Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Ā 
Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
mriyagarg453
Ā 
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
SUHANI PANDEY
Ā 
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our EscortsVIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
sonatiwari757
Ā 
@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Ā 
@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Ā 
Call Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In Amritsar
Call Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In AmritsarCall Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In Amritsar
Call Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In Amritsar
only4webmaster01
Ā 
VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...
VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...
VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...
SUHANI PANDEY
Ā 
young call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Serviceyoung call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
Ā 
@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
Ā 

KĆ¼rzlich hochgeladen (20)

Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 17 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Ā 
Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...
Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...
Sensual Moments: +91 9999965857 Independent Call Girls Noida Delhi {{ Monika}...
Ā 
Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ambala Escorts Service ā˜Žļø 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Ā 
Pakistani Call girls in Ajman +971563133746 Ajman Call girls
Pakistani Call girls in Ajman +971563133746 Ajman Call girlsPakistani Call girls in Ajman +971563133746 Ajman Call girls
Pakistani Call girls in Ajman +971563133746 Ajman Call girls
Ā 
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
VVIP Pune Call Girls Sopan Baug WhatSapp Number 8005736733 With Elite Staff A...
Ā 
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our EscortsVIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
VIP Call Girl Amritsar 7001035870 Enjoy Call Girls With Our Escorts
Ā 
SME IPO and sme ipo listing consultants .pptx
SME IPO and sme ipo listing consultants .pptxSME IPO and sme ipo listing consultants .pptx
SME IPO and sme ipo listing consultants .pptx
Ā 
VIP 7001035870 Find & Meet Hyderabad Call Girls Abids high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Abids high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls Abids high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls Abids high-profile Call Girl
Ā 
Vijayawada ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready F...
Vijayawada ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready F...Vijayawada ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready F...
Vijayawada ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready F...
Ā 
@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Janakpuri šŸ’“ High Profile Escorts Delhi šŸ«¶
Ā 
@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Karol Bagh šŸ’“ High Profile Escorts Delhi šŸ«¶
Ā 
(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ VIP Call Girls Friends Colony šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment...
(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ VIP Call Girls Friends Colony šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment...(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ VIP Call Girls Friends Colony šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment...
(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ VIP Call Girls Friends Colony šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment...
Ā 
Best investment platform in india-Falcon Invoice Discounting
Best investment platform in india-Falcon Invoice DiscountingBest investment platform in india-Falcon Invoice Discounting
Best investment platform in india-Falcon Invoice Discounting
Ā 
Call Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In Amritsar
Call Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In AmritsarCall Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In Amritsar
Call Girls In Amritsar šŸ’ÆCall Us šŸ” 76967 34778šŸ” šŸ’ƒ Independent Escort In Amritsar
Ā 
VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...
VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...
VVIP Pune Call Girls Handewadi WhatSapp Number 8005736733 With Elite Staff An...
Ā 
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
VIP 7001035870 Find & Meet Hyderabad Call Girls Shamshabad high-profile Call ...
Ā 
Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024Collective Mining | Corporate Presentation - May 2024
Collective Mining | Corporate Presentation - May 2024
Ā 
(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ Russian Call Girls Aerocity šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment F...
(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ Russian Call Girls Aerocity šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment F...(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ Russian Call Girls Aerocity šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment F...
(šŸ‘‰ļ¾Ÿ9999965857 ļ¾Ÿ)šŸ‘‰ Russian Call Girls Aerocity šŸ‘‰ Delhi šŸ‘ˆ : 9999 Cash Payment F...
Ā 
young call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Serviceyoung call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Service
young call girls in Mahavir Nagar šŸ” 9953056974 šŸ” Delhi escort Service
Ā 
@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶
@9999965857 šŸ«¦ Sexy Desi Call Girls Vaishali šŸ’“ High Profile Escorts Delhi šŸ«¶
Ā 

C gros-webscience-talk

  • 1. How do I develop and found a search engine? Files and ļ¬le search in the Internet from the perspective of FindFiles.net Claudius Gros Institute for Theoretical Physics Goethe University Frankfurt, Germany http://www.findfiles.net 1
  • 2. overview data in the Internet ā€“ Mime types and statistics ā€“ the ļ¬le search engine FindFiles.net science with data ļ¬les ā€“ neuropsychological constraints to human data production 2
  • 4. Internet hosts Hosts ā€“ Domains ā€“ Sites [source: Netcraft.com] ā€¢ 2011 āˆ¼ 100 active Mio domains 4
  • 5. Internet users ā€“ email users in 2010 ā€¢ 2 Ā· 109 worldwide ā€¢ 825 Ā· 106 Asia ā€¢ 475 Ā· 106 Europe ā€¢ 266 Ā· 106 North American emails in 2010 ā€¢ 107 Ā· 1012 ā€“ number of emails sent ā€¢ 89% ā€“ share of spam emails (success rate: 1:12 Mio ?) ā€¢ 1.9 Ā· 109 ā€“ number of email users [source: Pingdom.com] 5
  • 6. Internet ā€“ social media 2010 ā€¢ 152 Ā· 106 ā€“ number of blogs ā€¢ 25 Ā· 109 ā€“ number of tweets on Twitter ā€¢ 600 Ā· 106 ā€“ Facebook acounts 30 Ā· 109 ā€“ pieces of content: links, notes, images, ... 20 Ā· 106 ā€“ number of activated apps (per day) [source: Pingdom.com] 6
  • 7. social media ā€“ images and videos streaming videos ā€“ tubes ā€¢ 2 Ā· 109 ā€“ watched per day on Youtube (one per Internet user) ā€¢ 35 ā€“ hours of video uploaded (every minute) images, pictures ā€¢ 5 Ā· 109 ā€“ photos hosted by Flickr ā€¢ 3000 ā€“ photos uploaded (per minute) [source: Pingdom.com] 7
  • 8. social media ā€“ blogs & bookmarking blogs are everywhere ā€¢ 2010: 50 āˆ’ 100 Ā· 106 blogs social is everthing ā€¢ Digg, Mister Wong, Delicious, ... ā€¢ social shopping, ... http://www.delicious.com 8
  • 9. Internet ā€“ rules of thumb 2010 ā€¢ 1 movie per day per user (Youtube) ā€¢ 1 search query per day per every 2 users (Google) ā€¢ every Internet user uses email ā€¢ 5 ā€˜trueā€™ emails per day per Internet user ā€¢ 25% of Internet users use novel social media ā€¢ most domains are blogs 9
  • 11. slow beginnings for startups in the Internet Twitter ā€¢ linear scale ā€“ exponential or linear growth? Apr 2011 ā€“ 155 daily tweets 11
  • 12. growth is generically not exponential Google 30 revenues per year (billions, US-$) Google yearly revenues 25 20 15 10 5 0 2000 2002 2004 2006 2008 2010 year ā€¢ linear scale 12
  • 13. Internet: the winner takes all ļ¬‚ow of attention in complex networks www.small.net www.small.org www.big.com www.medium.com www.small.com www.small.de ā€¢ in-degree distribution pk heavy tails ā€¢ preferential attachment 13
  • 14. in-degree distribution power law ā€“ scale invariant 8 6 log(number of hosts) 4 2 0 -2 number of incomming links -4 linear fit, slope -2.2: 7.51-2.2*x 0 1 2 3 4 5 6 10 10 10 10 10 10 10 number of incomming links [source: Findļ¬les.net] ā€¢ scaling constant for 20 years ā€“ starting at one! 14
  • 15. limiting diverging in-degree distribution āˆž 1 k k2āˆ’Ī± pk āˆ Ī± , k āˆ dk āˆ¼ k k Ī± 2āˆ’Ī± Kc ā€¢ diverging mean in-degree lim k ā†’ āˆž Ī±ā†’2 Internet: Ī± ā‰ˆ 1.9 āˆ’ 2.2 limiting dominating tail Ā» limiting winners take all Ā« ā€¢ makes life difļ¬cult for small startups 15
  • 16. the big two uphill ļ¬ghts a new Internet startup needs to ... ā€¢ ļ¬ght for attention ā€¢ ļ¬ght for novelty trafļ¬c and quality heavy tail in-degree distribution makes it difļ¬cult to attract trafļ¬ce extremly high service standards act as effective entry barriers 16
  • 17. FindFiles.net ā€“ a new ļ¬le search engine 17
  • 18. public data on the Internet 280 Million domains in 2011 10-30 data ļ¬les per domain Internet Media type ā€“ Mime type ā€¢ categorization of all ļ¬le types email attachments browser add-ons about āˆ¼ 600 Mime types in use 18
  • 19. Mime types major Mime categories 33.2% application/ 2.9% audio/ ā€¢ together: 99% 58.0% image/ 5.1% text/ 0.7% video/ Mime types ā€“ examples application/pdf audio/mpeg application/msword audio/midi application/vnd.android.package-archive chemical/x-pdb application/vnd.ms-powerpoint image/jpeg application/jar image/vnd.djvu application/x-deb text/xml application/x-gzip model/vrml 19
  • 20. FindFiles.net search engine for data ļ¬les G. Kaczor & C. Gros 2011 ā€¢ supports all Mine types http://www.findfiles.net 20
  • 21. FindFiles.net ā€“ some stats daily queries [source: FindFiles.net] ā€¢ 400 Mio data ļ¬les 20 Mio host crawled 10 Million mp3 ļ¬les 10 000 apps for Symbian/Android smartphones ... 21
  • 22. blogs, legal issues & ļ¬nancing blog & press coverage http://www.findfiles.net/publicrelations copyright & non-legal ļ¬les ā€¢ ļ¬les protected by copyright/licence are not indexed (nofollow) ā€¢ links to pirate ļ¬les removed from index ļ¬nancing ā€¢ network ā€“ Unibator ā€¢ banks are cautious ā€“ most startups fail 22
  • 23. Science with Data Files 23
  • 24. the Wikipedia/DMOZ corpus all outgoing links of ā€¢ Wikipedia (all languages) ā€¢ DMOZ ā€“ open directory project (all languages) 7.7 Mio hosts (domains) 252 Mio data ļ¬les (FindFiles.net crawler) analysis of ļ¬le size distribution ā€¢ tails & scaling behaviour 24
  • 25. number of ļ¬les per domain ļ¬les per host vs. in-degree ā€¢ most ļ¬les hosted on small domains 25
  • 26. ļ¬le size distribution number of ļ¬les of given size 8 6 log(number of files) 4 2 0 all Mime categories Mime category application/ -2 Mime category audio/ Mime category image/ -4 Mime category text/ Mime category video/ 10 B 100 B 1 K 10 K 100 K 1 M 10 M 100 M 1 G 10 G file size [Bytes] ā€¢ 252 Mio ļ¬les in total ā€“ 9 orders of magnitude 26
  • 27. power-law scaling of image-size distribution ā€¢ compression gif: lossless; jpeg: lossy 6 4 log(number of files) 2 0 all Mime categories Mime type image/jpeg -2 linear fit, slope -2 linear fit, slope -4 -4 Mime type image/gif linear fit, slope -2.45 1K 10 K 100 K 1M 10 M 100 M 1G file size [Bytes] ā€¢ kink at 4 Mbytes: amateur ā€“ professional 27
  • 28. lognormal multimedia size distribution all audio and video Mime types 4 2 log(number of files) 0 -2 all Mime categories Mime category video/ quadratic fit (lognormal distribution) -4 Mime category audio/ quadratic fit (lognormal distribution) 1K 10 K 100 K 1M 10 M 100 M 1G 10 G file size [Bytes] ā€¢ quadratic ļ¬t ā€“ lognormal distribution 28
  • 29. lognormal distribution vs. powerlaw scaling ļ¬les-size distribution p(s) [log(s)āˆ’Āµ]2 /Ļƒ2 e sāˆ’Ī± not a Taylor-series correction log (p(s)) āˆ Ī± log(s) āˆ’ Ī² log2 (s) images: Ī± < 0, Ī² = 0. audio/video: Ī± > 0, Ī²>0 4 6 4 2 log(number of files) log(number of files) 2 0 0 all Mime categories -2 Mime type image/jpeg all Mime categories -2 linear fit, slope -2 Mime category video/ linear fit, slope -4 quadratic fit (lognormal distribution) Mime type image/gif -4 Mime category audio/ -4 linear fit, slope -2.45 quadratic fit (lognormal distribution) 1K 10 K 100 K 1M 10 M 100 M 1G 1K 10 K 100 K 1M 10 M 100 M 1G file size [Bytes] file size [Bytes] 29
  • 30. one vs. two-dimensional cost functions economical cost functions for data production ā€¢ size storage costs production costs psychophysical cost functions for data production ā€¢ size (images) time needed to take an image is independent of resolution ā€¢ size and time (audio & video) time and resolution are psychophysical distinct variables 30
  • 31. Weber-Fechner law ā€¢ neuopsychological cost functions are logarithmic in sensory stimulus intensity number of objects time perception music: tone pitch āˆ log(frequency) (octave) photometry: brightness āˆ log(intensity) (lumen) acoustics: sound level āˆ log(intensity) [decibel] information production: number of objects / time 31
  • 32. information entropy Shannon information entropy āˆ’ p(s) log(p(s))ds p(s) ds = 1 for a distribution function p(s) ā€¢ a measure for the information content Shannon coding theorem Mimimal amount of bytes needed to encode a transmission is given by the information entropy of the signal statistics 32
  • 33. neuropsychological cost functions conditional entropy maximization Ī“ āˆ’ p(s) log(p(s)) ds āˆ’ Ī» p(s)c(s) ds = 0 Shannon information entropy: āˆ’ p(s) log(p(s))ds cost function: c(s) ļ¬le size distribution: p(s) maximal ļ¬le size distributions ļ£± ļ£“ exponential c(s) āˆ s ļ£² physical p(s) āˆ eāˆ’Ī»c(s) āˆ¼ power law c(s) āˆ log(s) 1-dim neuro c(s) āˆ log2 (s) 2-dim neuro ļ£“ ļ£³ lognormal 33
  • 34. physical vs. neuropsychological cost functions 6 images 4 log(number of files) 2 physical exponential [not seen] 0 1-dim neuro power law [linear] all Mime categories Mime type image/jpeg -2 linear fit, slope -2 linear fit, slope -4 -4 Mime type image/gif linear fit, slope -2.45 1K 10 K 100 K 1M 10 M 100 M 1G file size [Bytes] 4 audio/video 2 log(number of files) physical exponential [not seen] 0 2-dim neuro lognormal [quadradic] -2 all Mime categories Mime category video/ quadratic fit (lognormal distribution) -4 Mime category audio/ quadratic fit (lognormal distribution) 1K 10 K 100 K 1M 10 M 100 M 1G 10 G file size [Bytes] 34
  • 35. global human data production basic assumptions ā€¢ information production as underlying driving force information entropy as a suitable measure ā€¢ law of large numbers average over production processes / producting agents compression/technology correspond to rescaling data production on a global level characterized by neuropsychological cost functions and not be eco- nomic constraints 35
  • 36. the Internet & complex system theory complex system theory ā€“ still an emergent ļ¬eld many models and paradigms yet to be formulated network theory / game theory / allocation problems macroecology / systems biology / cognitive systems theory ... ā€¢ information entropy maximization human data production on a global level neuropsychological cost functions ... 36
  • 37. graduate level textbook ā€¢ Information theory and complexity ā€¢ Phase transitions and self-organized criticality ā€¢ Life at the edge of chaos and punctuated equilibrium ā€¢ Cognitive system theory and diffusive emotional control second edition 2010 37