SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Outline                           WIRE Project                   Web Crawler               Conclusions




            WIRE: an Open Source Web Information
                    Retrieval Environment

                           Carlos Castillo and Ricardo Baeza-Yates
                                            Center for Web Research
                                             http://www.cwr.cl/
                                          CS Dept., University of Chile


                                              OSWIR 2005
                                           Compiegne, France
                                           September 19, 2005

Carlos Castillo and Ricardo Baeza-Yates                                        Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                        http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions




          1 WIRE Project



          2 Web Crawler



          3 Conclusions




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Motivation


          Study subsets of the Web (1-50 million pages)
            V We want high performance
            V We want to keep as much data as possible
            V We want to study scheduling algorithms
            X wget is not enough
            X Large-scale crawlers were not publicly available




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                     Web Crawler                      Conclusions



General Architecture

                                                             XML Index           XML Search
                      Focused Crawling




                                                                                  Text Search
                                                             Text Index
                  Crawling                Collection
                                                              Statistics



                  Importing                                   Extracting



                              Clustering         Classification



Carlos Castillo and Ricardo Baeza-Yates                                                 Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                 http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Characteristics



          b Roughly 25,000 lines of open-source C/C++ code
          L Asynchronous DNS and HTTP requests, small memory
            and processing requirements (except during the analysis)
          V Highly configurable: rate of download, parser parameters,
            scheduling policy, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                        Web Crawler                      Conclusions



Web Crawler
                                                       Manager
                                                 Page score calculations
                                                 Long-term scheduling




                       Seeder                                                    Harvester
                                                       Collection
                    Link resolving                                          Short-term scheduling
                   Robots exclusions                                          Network transfers




                                                      Gatherer
                                                       Parsing
                                                    Link extraction


Carlos Castillo and Ricardo Baeza-Yates                                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                    http://www.cwr.cl/
Outline                           WIRE Project                    Web Crawler                     Conclusions



Scheduling


                                                 Future      Current
                                                                           =    Profit
                                                 Value        Value



                                                }
                      quality             0.4
             P1       freshness           0.1                              = Profit: 0.36
                                                    0.4       0.04
                      visited?            1



                                                }
                      quality             0.7
             P2       freshness           0.9                              = Profit: 0.07
                                                              0.63
                                                    0.7
                      visited?            1



                                                }
                      quality             0.6
                      freshness           -                               = Profit: 0.6
             P3                                     0.6       0
                      visited?            0

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                               http://www.cwr.cl/
Outline                            WIRE Project                         Web Crawler                            Conclusions



Downloading pages


                                                                  World Wide Web




          Web sites           S1          S2          S3          S4          S5          S6          S7
                                   P1,1        P2,1        P3,1        P4,1        P5,1        P6,1        P7,1
                                   P1,2        P2,2        P3,2        P4,2        P5,2        P6,2        P7,2
                                   P1,3        P2,3                    P4,3        P5,3        P6,2        P7,3
          Web pages
                                   P1,4        P2,4                    P4,4        P5,4                    P7,4
                                               P2,5                    P4,5                                P7,5
                                               P2,6
Carlos Castillo and Ricardo Baeza-Yates                                                           Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                   Conclusions



Storing contents
                                Document

                                                         1         hash(       )
                                                 Content seen?

                                      2



                                                   3
                                                             Disk Storage


                                     Free space list

Carlos Castillo and Ricardo Baeza-Yates                                            Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                            http://www.cwr.cl/
Outline                           WIRE Project                   Web Crawler                       Conclusions



URL parsing

                                     http://host.domain.com/dir/file.html
                            1

                                                                    3
                h1('host.domain.com')


                                                                   h2('235 dir/file.html')




                host.domain.com 235
                                                 2
                                                             235 path/file.html 9421
                                                                                      4
                            SITE-ID = 235; DOC-ID = 9421

Carlos Castillo and Ricardo Baeza-Yates                                              Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                                http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Practical problems


          Z  The devil is in the details
          §  Varying quality of service
          §  Wrong DNS records, temporary DNS failures
          §  HTTP responses without headers, with wrong headers,
             dates
           § HTML parsing has to be very tolerant
           § Duplicate pages, session-ids, etc.




Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project               Web Crawler               Conclusions



Data analysis


          b Includes link analysis and extraction of statistics (data is
            exported as .csv files)
          b Reports are generated using LTEXand gnuplot
                                          A

          b Report about documents: histograms of size, in- and
            out-degree, link scores, page depth, HTTP responses,
            age, media types, etc.
          b Report about sites: degree distribution in the hostgraph,
            maximum depth, pages per site, link structure, etc.



Carlos Castillo and Ricardo Baeza-Yates                                    Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                    http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/
Outline                           WIRE Project                Web Crawler               Conclusions



Conclusions


           V A tool for Web characterization studies
           V Can be extended for other purposes
           V Code and documentation available at
             http://www.cwr.cl/projects/

                                                 Thank you.




Carlos Castillo and Ricardo Baeza-Yates                                     Center for Web Research
WIRE: an Open Source Web Information Retrieval Environment                     http://www.cwr.cl/

Weitere ähnliche Inhalte

Ähnlich wie WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph Community
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the InternetIRJET Journal
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea, Inc.
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInHakka Labs
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Artefactual Systems - AtoM
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Microsoft Azure for Research
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterIan Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaNGDATA
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSSri Ambati
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopNikolai Avteniev
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkMike Taylor
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Blue BRIDGE
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsE. Murphy
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Robert Grossman
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizerJohannes Keizer
 

Ähnlich wie WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne) (20)

Ceph used in Cancer Research at OICR
Ceph used in Cancer Research at OICRCeph used in Cancer Research at OICR
Ceph used in Cancer Research at OICR
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in Tapio
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
Technologie Proche: Imagining the Archival Systems of Tomorrow With the Tools...
 
Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)Accelerating your Research with Microsoft Azure (June 2015)
Accelerating your Research with Microsoft Azure (June 2015)
 
GENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian FosterGENI Engineering Conference -- Ian Foster
GENI Engineering Conference -- Ian Foster
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Devoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in JavaDevoxx 2010 | LAB : ReST in Java
Devoxx 2010 | LAB : ReST in Java
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Towards a Web of Services
Towards a Web of ServicesTowards a Web of Services
Towards a Web of Services
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Building Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On HadoopBuilding Satori: Web Data Extraction On Hadoop
Building Satori: Web Data Extraction On Hadoop
 
Case Study for Ego-centric Citation Network
Case Study for Ego-centric Citation NetworkCase Study for Ego-centric Citation Network
Case Study for Ego-centric Citation Network
 
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
Descriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory InstitutionsDescriptive Standards and Applications in Memory Institutions
Descriptive Standards and Applications in Memory Institutions
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer2012 09 aos-workshop-johanneskeizer
2012 09 aos-workshop-johanneskeizer
 

Mehr von Carlos Castillo (ChaTo)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social MediaCarlos Castillo (ChaTo)
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Carlos Castillo (ChaTo)
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Carlos Castillo (ChaTo)
 

Mehr von Carlos Castillo (ChaTo) (20)

Finding High Quality Content in Social Media
Finding High Quality Content in Social MediaFinding High Quality Content in Social Media
Finding High Quality Content in Social Media
 
When no clicks are good news
When no clicks are good newsWhen no clicks are good news
When no clicks are good news
 
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
Socia Media and Digital Volunteering in Disaster Management @ DSEM 2017
 
Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)Detecting Algorithmic Bias (keynote at DIR 2016)
Detecting Algorithmic Bias (keynote at DIR 2016)
 
Discrimination Discovery
Discrimination DiscoveryDiscrimination Discovery
Discrimination Discovery
 
Fairness-Aware Data Mining
Fairness-Aware Data MiningFairness-Aware Data Mining
Fairness-Aware Data Mining
 
Big Crisis Data for ISPC
Big Crisis Data for ISPCBig Crisis Data for ISPC
Big Crisis Data for ISPC
 
Databeers: Big Crisis Data
Databeers: Big Crisis DataDatabeers: Big Crisis Data
Databeers: Big Crisis Data
 
Observational studies in social media
Observational studies in social mediaObservational studies in social media
Observational studies in social media
 
Natural experiments
Natural experimentsNatural experiments
Natural experiments
 
Content-based link prediction
Content-based link predictionContent-based link prediction
Content-based link prediction
 
Link prediction
Link predictionLink prediction
Link prediction
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
 
Finding Dense Subgraphs
Finding Dense SubgraphsFinding Dense Subgraphs
Finding Dense Subgraphs
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Link-Based Ranking
Link-Based RankingLink-Based Ranking
Link-Based Ranking
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
Indexing
IndexingIndexing
Indexing
 
Text Summarization
Text SummarizationText Summarization
Text Summarization
 

Kürzlich hochgeladen

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 

WIRE, an open-source information retrieval environment (OSWIR 2005 Compiegne)

  • 1. Outline WIRE Project Web Crawler Conclusions WIRE: an Open Source Web Information Retrieval Environment Carlos Castillo and Ricardo Baeza-Yates Center for Web Research http://www.cwr.cl/ CS Dept., University of Chile OSWIR 2005 Compiegne, France September 19, 2005 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 2. Outline WIRE Project Web Crawler Conclusions 1 WIRE Project 2 Web Crawler 3 Conclusions Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 3. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 4. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 5. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 6. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 7. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 8. Outline WIRE Project Web Crawler Conclusions Motivation Study subsets of the Web (1-50 million pages) V We want high performance V We want to keep as much data as possible V We want to study scheduling algorithms X wget is not enough X Large-scale crawlers were not publicly available Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 9. Outline WIRE Project Web Crawler Conclusions General Architecture XML Index XML Search Focused Crawling Text Search Text Index Crawling Collection Statistics Importing Extracting Clustering Classification Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 10. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 11. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 12. Outline WIRE Project Web Crawler Conclusions Characteristics b Roughly 25,000 lines of open-source C/C++ code L Asynchronous DNS and HTTP requests, small memory and processing requirements (except during the analysis) V Highly configurable: rate of download, parser parameters, scheduling policy, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 13. Outline WIRE Project Web Crawler Conclusions Web Crawler Manager Page score calculations Long-term scheduling Seeder Harvester Collection Link resolving Short-term scheduling Robots exclusions Network transfers Gatherer Parsing Link extraction Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 14. Outline WIRE Project Web Crawler Conclusions Scheduling Future Current = Profit Value Value } quality 0.4 P1 freshness 0.1 = Profit: 0.36 0.4 0.04 visited? 1 } quality 0.7 P2 freshness 0.9 = Profit: 0.07 0.63 0.7 visited? 1 } quality 0.6 freshness - = Profit: 0.6 P3 0.6 0 visited? 0 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 15. Outline WIRE Project Web Crawler Conclusions Downloading pages World Wide Web Web sites S1 S2 S3 S4 S5 S6 S7 P1,1 P2,1 P3,1 P4,1 P5,1 P6,1 P7,1 P1,2 P2,2 P3,2 P4,2 P5,2 P6,2 P7,2 P1,3 P2,3 P4,3 P5,3 P6,2 P7,3 Web pages P1,4 P2,4 P4,4 P5,4 P7,4 P2,5 P4,5 P7,5 P2,6 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 16. Outline WIRE Project Web Crawler Conclusions Storing contents Document 1 hash( ) Content seen? 2 3 Disk Storage Free space list Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 17. Outline WIRE Project Web Crawler Conclusions URL parsing http://host.domain.com/dir/file.html 1 3 h1('host.domain.com') h2('235 dir/file.html') host.domain.com 235 2 235 path/file.html 9421 4 SITE-ID = 235; DOC-ID = 9421 Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 18. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 19. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 20. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 21. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 22. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 23. Outline WIRE Project Web Crawler Conclusions Practical problems Z The devil is in the details § Varying quality of service § Wrong DNS records, temporary DNS failures § HTTP responses without headers, with wrong headers, dates § HTML parsing has to be very tolerant § Duplicate pages, session-ids, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 24. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 25. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 26. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 27. Outline WIRE Project Web Crawler Conclusions Data analysis b Includes link analysis and extraction of statistics (data is exported as .csv files) b Reports are generated using LTEXand gnuplot A b Report about documents: histograms of size, in- and out-degree, link scores, page depth, HTTP responses, age, media types, etc. b Report about sites: degree distribution in the hostgraph, maximum depth, pages per site, link structure, etc. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 28. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 29. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 30. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 31. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/
  • 32. Outline WIRE Project Web Crawler Conclusions Conclusions V A tool for Web characterization studies V Can be extended for other purposes V Code and documentation available at http://www.cwr.cl/projects/ Thank you. Carlos Castillo and Ricardo Baeza-Yates Center for Web Research WIRE: an Open Source Web Information Retrieval Environment http://www.cwr.cl/