Part 2 of a three part presentation showing how nutch and solr may be used to crawl the web, extract data and prepare it for loading into a data warehouse.
Powerful Google developer tools for immediate impact! (2023-24 C)
Â
Web scraping with nutch solr part 2
1. Web Scraping Using Nutch and Solr - Part 2
â The following example assumes that you have
â Watched âweb scraping with nutch and solrâ
â The above movie identity is cAiYBD4BQeE
â Set up Linux based Nutch/Solr environment
â Run the web scrape in the above movie
â Now we will
â Clean up that environment
â Web scrape a parameterised url
â View the urls in the data
2. Empty Nutch Database
â Clean up the Nutch crawl database
â Previously used apache-nutch-1.6/nutch_start.sh
â This contained -dir crawl option
â This created apache-nutch-1.6/crawl directory
â Which contains our Nutch data
â Clean this as
â cd apache-nutch-1.6; rm -rf crawl
â Only because it contained dummy data !
â Next run of script will create dir again
3. Empty Solr Database
â Clean Solr database via a url
â Book mark this url
â Only use it if you need to empty your data
â Run the following ( with solr server running )
â http://localhost:8983/solr/update?commit=true -d
'<delete><query>*:*</query></delete>'
4. Set up Nutch
â Now we will do something more complex
â Web scrape a url that has parameters i.e.
â http://<site>/<function>?var1=val1&var2=val2
â This web scrape will
â Have extra url characters '?=&'
â Need greater search depth
â Need better url filtering
â Remember that you need to get permission to scrape a third
party web site
5. Nutch Configuration
â Change seed file for Nutch
â apache-nutch-1.6/urls/seed.txt
â In this instance I will use a url of the form
â http://somesite.co.nz/Search?DateRange=7&industry=62
â ( this is not a real url â just an example )
â Change conf regex-urlfilter.txt entry i.e.
â # skip URLs containing certain characters
â -[*!@]
â # accept anything else
â +^http://([a-z0-9]*.)*somesite.co.nz/Search
â This will only consider some site Search urls
6. Run Nutch
â Now run nutch using start script
â cd apache-nutch-1.6 ; ./nutch_start.bash
â Monitor for errors in solr admin log window
â The Nutch crawl should end with
â crawl finished: crawl
7. Checking Data
â Data should have been indexed in Solr
â In Solr Admin window
â Set 'Core Selector' = collection1
â Click 'Query'
â In Query window set fl field = url
â Click Execute Query
â The result ( next ) shows the filtered list of urls in Solr
9. Results
â Congratulations you have completed your second crawl
â With parameterised urls
â More complex url filtering
â With a Solr Query search
10. Contact Us
â Feel free to contact us at
â www.semtech-solutions.co.nz
â info@semtech-solutions.co.nz
â We offer IT project consultancy
â We are happy to hear about your problems
â You can just pay for those hours that you need
â To solve your problems