SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Web Scraping 1-2-3 with
   Python + Scrapy
        Sammy Fung
    sammy.hk, gownjob.com
Today Agenda
● Some Cases
● Python and Scrapy
Web Scraping
● a computer software technique of extracting
  information from websites. (Wikipedia)
● for business, hobbies, research.......
● NOT talk about business cases today.
CableTV & NOWTV Programme
(Past)
● 2004.
● slow, slow, slow, or worst - can't connect.
● use Flash.
HK Observatory and Joint Typhoon
Warning Center
● no easy data exchange format, eg.
  RSS/Atom.
● We won't check websites everyday.
Transportation - KMB, PTES
● no map view on KMB website for a bus route
  in the past.
● Exteremly Poor, Ugly (or much worse) map
  UI on PTES.
My experiences on web scraping
● 2004: php
● year after: python
● recent year: python with scrapy
Document Types
● HTML, XML,......
● Text
● Others, eg. pictures, videos,......
Web Scraping
●   Look for right URLs to scrap.
●   Look for right content from webpages.
●   Saving data into data store.
●   When to run the web scraping program ?
What is Scrapy ?
● An open source web scraping framework for
  Python.
● Scrapy is a fast high-level screen scraping
  and web crawling framework, used to crawl
  websites and extract structured data from
  their pages. It can be used for a wide range
  of purposes, from data mining to monitoring
  and automated testing.
Features of Scrapy
● define data you want to scrapy
● write spider to extract data
● Built-in: selecting and extracting data from
  HTML and XML
● Built-in: JSON, CSV, XML output
● Interactive shell console
● Built-in: web service, telnet console, logging
● Others
Installation of Scrapy
●   pip
●   APT repo
●   RPM
●   tarball (binary/source)
Create new scrapy project
$ scrapy startproject mybot
mybot/
mybot/scrapy.cfg
mybot/mybot/items.py
mybot/mybot/pipeline.py
mybot/mybot/settings.py
mybot/mybot/spiders/myspider.py
etc.......
items.py
from scrapy.item import Item, Field

class HKOCurrentItem(Item):
 time = Field()
 station = Field()
 temperature = Field()
 humidity = Field()
 #......
spiders/hko_spider.py (1/5)
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from weatherhk.items import HKOCurrentItem

import datetime, re
spiders/hko_spider.py (2/5)
class HKOCurrentSpider(BaseSpider):
 name = "HKOCurrentSpider"
 #allowed_domains = ["www.weather.gov.hk"]
 start_urls = [
   "http://www.weather.gov.
hk/textonly/forecast/chinesewx.htm"
 ]
spiders/hko_spider.py (3/5)
def parse(self, response):
 hxs = HtmlXPathSelector(response)

 stations = []
 # Getting weather data from each stations.
 tx = hxs.select("//pre[1]/text()").re('[^n]*n')
spiders/hko_spider.py (4/5)
   for i in tx:
     if re.search(u'd 度',i):
       data = HKOCurrentItem()
       data['time'] = int(dt)
       data['station'] = self.station.code(i)
       data['temperature'] = int(re.findall(u'd+',i)
[0])
       stations.append(data)
spiders/hko_spider.py (5/5)
 return stations
pipelines.py (1/2)
class HKOCurrentPipeline(object):
 def process_item(self, item, spider):
   station = self.db[item['station']]
   storeditem = dict(item.__dict__)['_values']
pipelines.py (2/2)
   try:
     if 'temperature' in storeditem:
       lasttime = station.find({'temperature': {'$gt':
0}}).sort('time', -1).limit(1)
     if lasttime[0]['time'] != storeditem['time']:
       id = self.insert(storeditem)

  return item

Weitere ähnliche Inhalte

Was ist angesagt?

Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Steven Francia
 
Thinking in documents
Thinking in documentsThinking in documents
Thinking in documentsCésar Rodas
 
Dive into .git
Dive into .gitDive into .git
Dive into .gitnishio
 
世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するためにKuniaki Igarashi
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Kai Chan
 
Using MongoDB With Groovy
Using MongoDB With GroovyUsing MongoDB With Groovy
Using MongoDB With GroovyJames Williams
 
Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)MongoSF
 
7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)Steven Francia
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Дмитрий Бумов
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide webSergio Burdisso
 
7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid themSteven Francia
 
Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)EunGi Hong
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudreyAudrey Lim
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게Seongyun Byeon
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Steven Francia
 
Writing dumb tests
Writing dumb testsWriting dumb tests
Writing dumb testsLuke Lee
 
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Steven Pousty
 

Was ist angesagt? (20)

Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
 
Thinking in documents
Thinking in documentsThinking in documents
Thinking in documents
 
Dive into .git
Dive into .gitDive into .git
Dive into .git
 
世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために世界のどこかで楽しくRubyでお仕事するために
世界のどこかで楽しくRubyでお仕事するために
 
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
 
Using MongoDB With Groovy
Using MongoDB With GroovyUsing MongoDB With Groovy
Using MongoDB With Groovy
 
Django Mongodb Engine
Django Mongodb EngineDjango Mongodb Engine
Django Mongodb Engine
 
Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)Java Development with MongoDB (James Williams)
Java Development with MongoDB (James Williams)
 
Fun with Python
Fun with PythonFun with Python
Fun with Python
 
7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)
 
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
Bo0oM - There's Nothing so Permanent as Temporary (PHDays IV, 2014)
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide web
 
7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them
 
Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)Python learning for Natural Language Processing (2nd)
Python learning for Natural Language Processing (2nd)
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게성장을 좋아하는 사람이, 성장하고 싶은 사람에게
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go
 
Writing dumb tests
Writing dumb testsWriting dumb tests
Writing dumb tests
 
Kiosk / PHP
Kiosk / PHP Kiosk / PHP
Kiosk / PHP
 
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012
 

Andere mochten auch

Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012Jimmy Lai
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Bruno Rocha
 
Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15Dan Whitehouse
 
RESTful API Design Fundamentals
RESTful API Design FundamentalsRESTful API Design Fundamentals
RESTful API Design FundamentalsHüseyin BABAL
 
Web Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareWeb Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareNosheen Qamar
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 

Andere mochten auch (20)

Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014Web Crawling Modeling with Scrapy Models #TDC2014
Web Crawling Modeling with Scrapy Models #TDC2014
 
Scrapy
ScrapyScrapy
Scrapy
 
Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15Qredo gravytrain digital pitch '15
Qredo gravytrain digital pitch '15
 
CPA-Script
CPA-ScriptCPA-Script
CPA-Script
 
摘星
摘星摘星
摘星
 
Relevance Assessment Tool
Relevance Assessment ToolRelevance Assessment Tool
Relevance Assessment Tool
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
RESTful API Design Fundamentals
RESTful API Design FundamentalsRESTful API Design Fundamentals
RESTful API Design Fundamentals
 
Web::Scraper
Web::ScraperWeb::Scraper
Web::Scraper
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
Website vs web app
Website vs web appWebsite vs web app
Website vs web app
 
Mobile Website vs Mobile App
Mobile Website vs Mobile AppMobile Website vs Mobile App
Mobile Website vs Mobile App
 
Web Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional SoftwareWeb Engineering - Web Applications versus Conventional Software
Web Engineering - Web Applications versus Conventional Software
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Webscraping with asyncio
Webscraping with asyncioWebscraping with asyncio
Webscraping with asyncio
 

Ähnlich wie Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

Open Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object StorageOpen Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object StorageSammy Fung
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopPyCon Italia
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Sammy Fung
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)Sammy Fung
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIYoni Davidson
 
Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015Younggun Kim
 
Python. Why to learn?
Python. Why to learn?Python. Why to learn?
Python. Why to learn?Oleh Korkh
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015Younggun Kim
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Jimmy DeadcOde
 
Python in Industry
Python in IndustryPython in Industry
Python in IndustryDharmit Shah
 
[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep DiveAkihiro Suda
 
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...Megumi Takeshita
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server InternalsPraveen Gollakota
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward
 

Ähnlich wie Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version) (20)

Open Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object StorageOpen Source Weather Information Project with OpenStack Object Storage
Open Source Weather Information Project with OpenStack Object Storage
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntop
 
Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)Creating Open Data with Open Source (beta2)
Creating Open Data with Open Source (beta2)
 
How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)How do we develop open source software to help open data ? (MOSC 2013)
How do we develop open source software to help open data ? (MOSC 2013)
 
carrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-APIcarrow - Go bindings to Apache Arrow via C++-API
carrow - Go bindings to Apache Arrow via C++-API
 
Deploy your own P2P network
Deploy your own P2P networkDeploy your own P2P network
Deploy your own P2P network
 
Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015Writing Fast Code - PyCon HK 2015
Writing Fast Code - PyCon HK 2015
 
Python. Why to learn?
Python. Why to learn?Python. Why to learn?
Python. Why to learn?
 
The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...The magic behind your Lyft ride prices: A case study on machine learning and ...
The magic behind your Lyft ride prices: A case study on machine learning and ...
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015Writing Fast Code (JP) - PyCon JP 2015
Writing Fast Code (JP) - PyCon JP 2015
 
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
Website Monitoring with Distributed Messages/Tasks Processing (AMQP & RabbitM...
 
Python in Industry
Python in IndustryPython in Industry
Python in Industry
 
[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive[KubeCon EU 2020] containerd Deep Dive
[KubeCon EU 2020] containerd Deep Dive
 
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
PA-3 Debugging Wireless with Wireshark Including Large Trace Files, AirPcap &...
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server Internals
 
Crawler
CrawlerCrawler
Crawler
 
Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019Streaming your Lyft Ride Prices - Flink Forward SF 2019
Streaming your Lyft Ride Prices - Flink Forward SF 2019
 
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...
 

Mehr von Sammy Fung

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Sammy Fung
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineSammy Fung
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Sammy Fung
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunitySammy Fung
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web frameworkSammy Fung
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯Sammy Fung
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web APISammy Fung
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastSammy Fung
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the WebSammy Fung
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and CommunitySammy Fung
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsSammy Fung
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionSammy Fung
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source CommunitySammy Fung
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with DrupalSammy Fung
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at workSammy Fung
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and CommunitySammy Fung
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job BoardSammy Fung
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Sammy Fung
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSSammy Fung
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionSammy Fung
 

Mehr von Sammy Fung (20)

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to online
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and Community
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web framework
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 Forecast
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the Web
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and Community
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software Tools
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server Edition
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source Community
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with Drupal
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at work
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and Community
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job Board
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMS
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 

Kürzlich hochgeladen

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)

  • 1. Web Scraping 1-2-3 with Python + Scrapy Sammy Fung sammy.hk, gownjob.com
  • 2. Today Agenda ● Some Cases ● Python and Scrapy
  • 3. Web Scraping ● a computer software technique of extracting information from websites. (Wikipedia) ● for business, hobbies, research....... ● NOT talk about business cases today.
  • 4. CableTV & NOWTV Programme (Past) ● 2004. ● slow, slow, slow, or worst - can't connect. ● use Flash.
  • 5. HK Observatory and Joint Typhoon Warning Center ● no easy data exchange format, eg. RSS/Atom. ● We won't check websites everyday.
  • 6. Transportation - KMB, PTES ● no map view on KMB website for a bus route in the past. ● Exteremly Poor, Ugly (or much worse) map UI on PTES.
  • 7. My experiences on web scraping ● 2004: php ● year after: python ● recent year: python with scrapy
  • 8. Document Types ● HTML, XML,...... ● Text ● Others, eg. pictures, videos,......
  • 9. Web Scraping ● Look for right URLs to scrap. ● Look for right content from webpages. ● Saving data into data store. ● When to run the web scraping program ?
  • 10. What is Scrapy ? ● An open source web scraping framework for Python. ● Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
  • 11. Features of Scrapy ● define data you want to scrapy ● write spider to extract data ● Built-in: selecting and extracting data from HTML and XML ● Built-in: JSON, CSV, XML output ● Interactive shell console ● Built-in: web service, telnet console, logging ● Others
  • 12. Installation of Scrapy ● pip ● APT repo ● RPM ● tarball (binary/source)
  • 13. Create new scrapy project $ scrapy startproject mybot mybot/ mybot/scrapy.cfg mybot/mybot/items.py mybot/mybot/pipeline.py mybot/mybot/settings.py mybot/mybot/spiders/myspider.py etc.......
  • 14. items.py from scrapy.item import Item, Field class HKOCurrentItem(Item): time = Field() station = Field() temperature = Field() humidity = Field() #......
  • 15. spiders/hko_spider.py (1/5) from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from weatherhk.items import HKOCurrentItem import datetime, re
  • 16. spiders/hko_spider.py (2/5) class HKOCurrentSpider(BaseSpider): name = "HKOCurrentSpider" #allowed_domains = ["www.weather.gov.hk"] start_urls = [ "http://www.weather.gov. hk/textonly/forecast/chinesewx.htm" ]
  • 17. spiders/hko_spider.py (3/5) def parse(self, response): hxs = HtmlXPathSelector(response) stations = [] # Getting weather data from each stations. tx = hxs.select("//pre[1]/text()").re('[^n]*n')
  • 18. spiders/hko_spider.py (4/5) for i in tx: if re.search(u'd 度',i): data = HKOCurrentItem() data['time'] = int(dt) data['station'] = self.station.code(i) data['temperature'] = int(re.findall(u'd+',i) [0]) stations.append(data)
  • 20. pipelines.py (1/2) class HKOCurrentPipeline(object): def process_item(self, item, spider): station = self.db[item['station']] storeditem = dict(item.__dict__)['_values']
  • 21. pipelines.py (2/2) try: if 'temperature' in storeditem: lasttime = station.find({'temperature': {'$gt': 0}}).sort('time', -1).limit(1) if lasttime[0]['time'] != storeditem['time']: id = self.insert(storeditem) return item