Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Web scraping with nutch solr part 2

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 10 Anzeige

Web scraping with nutch solr part 2

Herunterladen, um offline zu lesen

Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.

Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Web scraping with nutch solr part 2 (20)

Anzeige

Aktuellste (20)

Anzeige

Web scraping with nutch solr part 2

  1. 1. Web Scraping Using Nutch and Solr - Part 2 ● The following example assumes that you have – Watched “web scraping with nutch and solr” – The above movie identity is cAiYBD4BQeE – Set up Linux based Nutch/Solr environment – Run the web scrape in the above movie ● Now we will – Clean up that environment – Web scrape a parameterised url – View the urls in the data
  2. 2. Empty Nutch Database ● Clean up the Nutch crawl database – Previously used apache-nutch-1.6/nutch_start.sh – This contained -dir crawl option – This created apache-nutch-1.6/crawl directory – Which contains our Nutch data ● Clean this as – cd apache-nutch-1.6; rm -rf crawl ● Only because it contained dummy data ! ● Next run of script will create dir again
  3. 3. Empty Solr Database ● Clean Solr database via a url – Book mark this url – Only use it if you need to empty your data ● Run the following ( with solr server running ) – http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
  4. 4. Set up Nutch ● Now we will do something more complex ● Web scrape a url that has parameters i.e. – http://<site>/<function>?var1=val1&var2=val2 ● This web scrape will – Have extra url characters '?=&' – Need greater search depth – Need better url filtering ● Remember that you need to get permission to scrape a third party web site
  5. 5. Nutch Configuration ● Change seed file for Nutch ● apache-nutch-1.6/urls/seed.txt ● In this instance I will use a url of the form – http://somesite.co.nz/Search?DateRange=7&industry=62 – ( this is not a real url – just an example ) ● Change conf regex-urlfilter.txt entry i.e. – # skip URLs containing certain characters – -[*!@] – # accept anything else – +^http://([a-z0-9]*.)*somesite.co.nz/Search ● This will only consider some site Search urls
  6. 6. Run Nutch ● Now run nutch using start script – cd apache-nutch-1.6 ; ./nutch_start.bash ● Monitor for errors in solr admin log window ● The Nutch crawl should end with – crawl finished: crawl
  7. 7. Checking Data ● Data should have been indexed in Solr ● In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query ● The result ( next ) shows the filtered list of urls in Solr
  8. 8. Checking Data
  9. 9. Results ● Congratulations you have completed your second crawl – With parameterised urls – More complex url filtering – With a Solr Query search
  10. 10. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems

×