SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Web Scraping Using Nutch and Solr - Part 2
● The following example assumes that you have
– Watched “web scraping with nutch and solr”
– The above movie identity is cAiYBD4BQeE
– Set up Linux based Nutch/Solr environment
– Run the web scrape in the above movie
● Now we will
– Clean up that environment
– Web scrape a parameterised url
– View the urls in the data
Empty Nutch Database
● Clean up the Nutch crawl database
– Previously used apache-nutch-1.6/nutch_start.sh
– This contained -dir crawl option
– This created apache-nutch-1.6/crawl directory
– Which contains our Nutch data
● Clean this as
– cd apache-nutch-1.6; rm -rf crawl
● Only because it contained dummy data !
● Next run of script will create dir again
Empty Solr Database
● Clean Solr database via a url
– Book mark this url
– Only use it if you need to empty your data
● Run the following ( with solr server running )
– http://localhost:8983/solr/update?commit=true -d
'<delete><query>*:*</query></delete>'
Set up Nutch
● Now we will do something more complex
● Web scrape a url that has parameters i.e.
– http://<site>/<function>?var1=val1&var2=val2
● This web scrape will
– Have extra url characters '?=&'
– Need greater search depth
– Need better url filtering
● Remember that you need to get permission to scrape a third
party web site
Nutch Configuration
● Change seed file for Nutch
● apache-nutch-1.6/urls/seed.txt
● In this instance I will use a url of the form
– http://somesite.co.nz/Search?DateRange=7&industry=62
– ( this is not a real url – just an example )
● Change conf regex-urlfilter.txt entry i.e.
– # skip URLs containing certain characters
– -[*!@]
– # accept anything else
– +^http://([a-z0-9]*.)*somesite.co.nz/Search
● This will only consider some site Search urls
Run Nutch
● Now run nutch using start script
– cd apache-nutch-1.6 ; ./nutch_start.bash
● Monitor for errors in solr admin log window
● The Nutch crawl should end with
– crawl finished: crawl
Checking Data
● Data should have been indexed in Solr
● In Solr Admin window
– Set 'Core Selector' = collection1
– Click 'Query'
– In Query window set fl field = url
– Click Execute Query
● The result ( next ) shows the filtered list of urls in Solr
Checking Data
Results
● Congratulations you have completed your second crawl
– With parameterised urls
– More complex url filtering
– With a Solr Query search
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Weitere Àhnliche Inhalte

Was ist angesagt?

Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platformabial
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst AgainVarun Thacker
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitabial
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with DockerFabio Fumarola
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosJoe Stein
 
Caching. api. http 1.1
Caching. api. http 1.1Caching. api. http 1.1
Caching. api. http 1.1Artjoker Digital
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)Amir Sedighi
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosJoe Stein
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases LabFabio Fumarola
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUAmir Sedighi
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQLRitesh Puthran
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single clusterSalil Navgire
 
[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…Ș
[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…Ș[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…Ș
[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…ȘNAVER D2
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Amir Sedighi
 

Was ist angesagt? (20)

Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Nutch as a Web data mining platform
Nutch as a Web data mining platformNutch as a Web data mining platform
Nutch as a Web data mining platform
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker8a. How To Setup HBase with Docker
8a. How To Setup HBase with Docker
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Caching. api. http 1.1
Caching. api. http 1.1Caching. api. http 1.1
Caching. api. http 1.1
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Hadoop 2.x  HDFS Cluster Installation (VirtualBox)Hadoop 2.x  HDFS Cluster Installation (VirtualBox)
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab8b. Column Oriented Databases Lab
8b. Column Oriented Databases Lab
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBU
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…Ș
[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…Ș[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…Ș
[2D1]Elasticsearch á„‰á…„á†Œá„‚á…łá†Œ á„Žá…Źá„Œá…„á†šá„’á…Ș
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)Elasticsearch 1.x Cluster Installation (VirtualBox)
Elasticsearch 1.x Cluster Installation (VirtualBox)
 

Ähnlich wie Web scraping with nutch solr part 2

Creating Elasticsearch Snapshots
Creating Elasticsearch SnapshotsCreating Elasticsearch Snapshots
Creating Elasticsearch SnapshotsVic Hargrave
 
Getting started with the Lupus Nuxt.js Drupal Stack
Getting started with the Lupus Nuxt.js Drupal StackGetting started with the Lupus Nuxt.js Drupal Stack
Getting started with the Lupus Nuxt.js Drupal Stacknuppla
 
Training Slides: 203 - Backup & Recovery
Training Slides: 203 - Backup & RecoveryTraining Slides: 203 - Backup & Recovery
Training Slides: 203 - Backup & RecoveryContinuent
 
Selenium grid workshop london 2016
Selenium grid workshop london 2016Selenium grid workshop london 2016
Selenium grid workshop london 2016Marcus Merrell
 
NGINX High-performance Caching
NGINX High-performance CachingNGINX High-performance Caching
NGINX High-performance CachingNGINX, Inc.
 
Cloud computing 3702
Cloud computing 3702Cloud computing 3702
Cloud computing 3702Jess Coburn
 
Elastic101tutorial Percona Live Europe 2018
Elastic101tutorial Percona Live Europe 2018Elastic101tutorial Percona Live Europe 2018
Elastic101tutorial Percona Live Europe 2018Alex Cercel
 
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Antonios Giannopoulos
 
Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhillyobdit
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisationgrooverdan
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialMongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialJason Terpko
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017Antonios Giannopoulos
 
NGINX 101 - now with more Docker
NGINX 101 - now with more DockerNGINX 101 - now with more Docker
NGINX 101 - now with more Dockersarahnovotny
 
NGINX 101 - now with more Docker
NGINX 101 - now with more DockerNGINX 101 - now with more Docker
NGINX 101 - now with more DockerSarah Novotny
 
Performance Optimization using Caching | Swatantra Kumar
Performance Optimization using Caching | Swatantra KumarPerformance Optimization using Caching | Swatantra Kumar
Performance Optimization using Caching | Swatantra KumarSwatantra Kumar
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios
 
Introduction to Drupal - Installation, Anatomy, Terminologies
Introduction to Drupal - Installation, Anatomy, TerminologiesIntroduction to Drupal - Installation, Anatomy, Terminologies
Introduction to Drupal - Installation, Anatomy, TerminologiesGerald Villorente
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewLei (Harry) Zhang
 

Ähnlich wie Web scraping with nutch solr part 2 (20)

Creating Elasticsearch Snapshots
Creating Elasticsearch SnapshotsCreating Elasticsearch Snapshots
Creating Elasticsearch Snapshots
 
Getting started with the Lupus Nuxt.js Drupal Stack
Getting started with the Lupus Nuxt.js Drupal StackGetting started with the Lupus Nuxt.js Drupal Stack
Getting started with the Lupus Nuxt.js Drupal Stack
 
Training Slides: 203 - Backup & Recovery
Training Slides: 203 - Backup & RecoveryTraining Slides: 203 - Backup & Recovery
Training Slides: 203 - Backup & Recovery
 
Selenium grid workshop london 2016
Selenium grid workshop london 2016Selenium grid workshop london 2016
Selenium grid workshop london 2016
 
NGINX High-performance Caching
NGINX High-performance CachingNGINX High-performance Caching
NGINX High-performance Caching
 
Cloud computing 3702
Cloud computing 3702Cloud computing 3702
Cloud computing 3702
 
Elastic101tutorial Percona Live Europe 2018
Elastic101tutorial Percona Live Europe 2018Elastic101tutorial Percona Live Europe 2018
Elastic101tutorial Percona Live Europe 2018
 
Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018 Elastic 101 tutorial - Percona Europe 2018
Elastic 101 tutorial - Percona Europe 2018
 
Scrapy talk at DataPhilly
Scrapy talk at DataPhillyScrapy talk at DataPhilly
Scrapy talk at DataPhilly
 
Clug 2012 March web server optimisation
Clug 2012 March   web server optimisationClug 2012 March   web server optimisation
Clug 2012 March web server optimisation
 
Sharded cluster tutorial
Sharded cluster tutorialSharded cluster tutorial
Sharded cluster tutorial
 
MongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster TutorialMongoDB - Sharded Cluster Tutorial
MongoDB - Sharded Cluster Tutorial
 
MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017MongoDB – Sharded cluster tutorial - Percona Europe 2017
MongoDB – Sharded cluster tutorial - Percona Europe 2017
 
NGINX 101 - now with more Docker
NGINX 101 - now with more DockerNGINX 101 - now with more Docker
NGINX 101 - now with more Docker
 
NGINX 101 - now with more Docker
NGINX 101 - now with more DockerNGINX 101 - now with more Docker
NGINX 101 - now with more Docker
 
Performance Optimization using Caching | Swatantra Kumar
Performance Optimization using Caching | Swatantra KumarPerformance Optimization using Caching | Swatantra Kumar
Performance Optimization using Caching | Swatantra Kumar
 
les03.pdf
les03.pdfles03.pdf
les03.pdf
 
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios CoreNagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
Nagios Conference 2014 - Eric Mislivec - Getting Started With Nagios Core
 
Introduction to Drupal - Installation, Anatomy, Terminologies
Introduction to Drupal - Installation, Anatomy, TerminologiesIntroduction to Drupal - Installation, Anatomy, Terminologies
Introduction to Drupal - Installation, Anatomy, Terminologies
 
Kubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical ViewKubernetes Walk Through from Technical View
Kubernetes Walk Through from Technical View
 

Mehr von Mike Frampton

Apache Airavata
Apache AiravataApache Airavata
Apache AiravataMike Frampton
 
Apache MADlib AI/ML
Apache MADlib AI/MLApache MADlib AI/ML
Apache MADlib AI/MLMike Frampton
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AIMike Frampton
 
Apache Gobblin
Apache GobblinApache Gobblin
Apache GobblinMike Frampton
 
Apache Singa AI
Apache Singa AIApache Singa AI
Apache Singa AIMike Frampton
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDBMike Frampton
 
An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache MesosMike Frampton
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to PentahoMike Frampton
 

Mehr von Mike Frampton (20)

Apache Airavata
Apache AiravataApache Airavata
Apache Airavata
 
Apache MADlib AI/ML
Apache MADlib AI/MLApache MADlib AI/ML
Apache MADlib AI/ML
 
Apache MXNet AI
Apache MXNet AIApache MXNet AI
Apache MXNet AI
 
Apache Gobblin
Apache GobblinApache Gobblin
Apache Gobblin
 
Apache Singa AI
Apache Singa AIApache Singa AI
Apache Singa AI
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
OrientDB
OrientDBOrientDB
OrientDB
 
Prometheus
PrometheusPrometheus
Prometheus
 
Apache Tephra
Apache TephraApache Tephra
Apache Tephra
 
Apache Kudu
Apache KuduApache Kudu
Apache Kudu
 
Apache Bahir
Apache BahirApache Bahir
Apache Bahir
 
Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
 
JanusGraph DB
JanusGraph DBJanusGraph DB
JanusGraph DB
 
Apache Ignite
Apache IgniteApache Ignite
Apache Ignite
 
Apache Samza
Apache SamzaApache Samza
Apache Samza
 
Apache Flink
Apache FlinkApache Flink
Apache Flink
 
Apache Edgent
Apache EdgentApache Edgent
Apache Edgent
 
Apache CouchDB
Apache CouchDBApache CouchDB
Apache CouchDB
 
An introduction to Apache Mesos
An introduction to Apache MesosAn introduction to Apache Mesos
An introduction to Apache Mesos
 
An introduction to Pentaho
An introduction to PentahoAn introduction to Pentaho
An introduction to Pentaho
 

KĂŒrzlich hochgeladen

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

KĂŒrzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Web scraping with nutch solr part 2

  • 1. Web Scraping Using Nutch and Solr - Part 2 ● The following example assumes that you have – Watched “web scraping with nutch and solr” – The above movie identity is cAiYBD4BQeE – Set up Linux based Nutch/Solr environment – Run the web scrape in the above movie ● Now we will – Clean up that environment – Web scrape a parameterised url – View the urls in the data
  • 2. Empty Nutch Database ● Clean up the Nutch crawl database – Previously used apache-nutch-1.6/nutch_start.sh – This contained -dir crawl option – This created apache-nutch-1.6/crawl directory – Which contains our Nutch data ● Clean this as – cd apache-nutch-1.6; rm -rf crawl ● Only because it contained dummy data ! ● Next run of script will create dir again
  • 3. Empty Solr Database ● Clean Solr database via a url – Book mark this url – Only use it if you need to empty your data ● Run the following ( with solr server running ) – http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
  • 4. Set up Nutch ● Now we will do something more complex ● Web scrape a url that has parameters i.e. – http://<site>/<function>?var1=val1&var2=val2 ● This web scrape will – Have extra url characters '?=&' – Need greater search depth – Need better url filtering ● Remember that you need to get permission to scrape a third party web site
  • 5. Nutch Configuration ● Change seed file for Nutch ● apache-nutch-1.6/urls/seed.txt ● In this instance I will use a url of the form – http://somesite.co.nz/Search?DateRange=7&industry=62 – ( this is not a real url – just an example ) ● Change conf regex-urlfilter.txt entry i.e. – # skip URLs containing certain characters – -[*!@] – # accept anything else – +^http://([a-z0-9]*.)*somesite.co.nz/Search ● This will only consider some site Search urls
  • 6. Run Nutch ● Now run nutch using start script – cd apache-nutch-1.6 ; ./nutch_start.bash ● Monitor for errors in solr admin log window ● The Nutch crawl should end with – crawl finished: crawl
  • 7. Checking Data ● Data should have been indexed in Solr ● In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query ● The result ( next ) shows the filtered list of urls in Solr
  • 9. Results ● Congratulations you have completed your second crawl – With parameterised urls – More complex url filtering – With a Solr Query search
  • 10. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems