Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.
1. Web Scraping Using Nutch and Solr - Part 2
● The following example assumes that you have
– Watched “web scraping with nutch and solr”
– The above movie identity is cAiYBD4BQeE
– Set up Linux based Nutch/Solr environment
– Run the web scrape in the above movie
● Now we will
– Clean up that environment
– Web scrape a parameterised url
– View the urls in the data
2. Empty Nutch Database
● Clean up the Nutch crawl database
– Previously used apache-nutch-1.6/nutch_start.sh
– This contained -dir crawl option
– This created apache-nutch-1.6/crawl directory
– Which contains our Nutch data
● Clean this as
– cd apache-nutch-1.6; rm -rf crawl
● Only because it contained dummy data !
● Next run of script will create dir again
3. Empty Solr Database
● Clean Solr database via a url
– Book mark this url
– Only use it if you need to empty your data
● Run the following ( with solr server running )
– http://localhost:8983/solr/update?commit=true -d
'<delete><query>*:*</query></delete>'
4. Set up Nutch
● Now we will do something more complex
● Web scrape a url that has parameters i.e.
– http://<site>/<function>?var1=val1&var2=val2
● This web scrape will
– Have extra url characters '?=&'
– Need greater search depth
– Need better url filtering
● Remember that you need to get permission to scrape a third
party web site
5. Nutch Configuration
● Change seed file for Nutch
● apache-nutch-1.6/urls/seed.txt
● In this instance I will use a url of the form
– http://somesite.co.nz/Search?DateRange=7&industry=62
– ( this is not a real url – just an example )
● Change conf regex-urlfilter.txt entry i.e.
– # skip URLs containing certain characters
– -[*!@]
– # accept anything else
– +^http://([a-z0-9]*.)*somesite.co.nz/Search
● This will only consider some site Search urls
6. Run Nutch
● Now run nutch using start script
– cd apache-nutch-1.6 ; ./nutch_start.bash
● Monitor for errors in solr admin log window
● The Nutch crawl should end with
– crawl finished: crawl
7. Checking Data
● Data should have been indexed in Solr
● In Solr Admin window
– Set 'Core Selector' = collection1
– Click 'Query'
– In Query window set fl field = url
– Click Execute Query
● The result ( next ) shows the filtered list of urls in Solr
9. Results
● Congratulations you have completed your second crawl
– With parameterised urls
– More complex url filtering
– With a Solr Query search
10. Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems