SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Python, Web Scraping and
Content Management:
Scrapy and Django
Sammy Fung
http://sammy.hk
OpenSource.HK Workshop 2014.07.05
Sammy Fung
● Perl → PHP → Python
● Linux → Open Source → Open Data
● Freelance → Startup
● http://sammy.hk
● sammy@sammy.hk
Open Data
Can computer program read this ?
Is this UI easy understanding ?
Five Star Open Data
1.make your stuff available on the Web (whatever format)
under an open license.
2.make it available as structured data (e.g., Excel instead of
image scan of a table)
3.use non-proprietary formats (e.g., CSV instead of Excel)
4.use URIs to denote things, so that people can point at your
stuff.
5.link your data to other data to provide context.
5stardata.info by Tim Berners-Lee, the inventor of the Web.
Open Data
● Data.One
– Lead by OGCIO of Hong Kong Government.
– Use the term “public sector information” (PSI)
insteads of “open data”.
– Many data are not available in machine-readable
format with useful data structure.
– A lot of data are still requiring web scraping with
customized data extraction to collect useful
machine-readable data.
Web Scraping
with Scrapy
Web Scraping
a computer software
technique of extracting
information from websites.
(Wikipedia)
Scrapy
● Python.
● Open source web scraping framework.
● Scrap websites and extract structured data.
● From data mining to monitoring and
automated testing.
Scrapy
● Define your own data structures.
● Write spiders to extract data.
● Built-in XPath selectors to extracting data.
● Built-in JSON, CSV, XML output.
● Interactive shell console, telnet console,
logging......
scrapyd
● Scrapy web service daemon.
● pip install scrapyd
● Web API with simple Web UI:
– http://localhost:6800
● Web API Documentation:
– http://scrapyd.readthedocs.org/en/latest/api.html
scrapyd
● Examples:
– curl http://localhost:6800/listprojects.json
– curl http://localhost:6800/listspiders.json?
project=default
● eg. {"status": "ok", "spiders": ["pollutant24", "aqhi24"]}
Scrapy Installation
$ apt-get install python python-virtualenv
python-pip
$ virtualenv env
$ source env/bin/activate
$ pip install scrapy
Creating Scrapy Project
$ scrapy startproject <new project name>
newproject
|-- newproject
| |-- __init__.py
| |-- items.py
| |-- pipelines.py
| |-- settings.py
| |-- spider
| |- __init__.py
|-- scrapy.cfg
Creating Scrapy Project
● Define your data structure
● Write your first spider
– Test with scrapy shell console
● Output / Store collected data
– Output with built-in supported formats
– Store to database / object store.
Define your data structure
items.py
class Hk0WeatherItem(Item):
reporttime = Field()
station = Field()
temperture = Field()
humidity = Field()
Write your first spider
● Import a Class of your own data structure.
– $ scrapy genspider -t basic <YOUR SPIDER NAME>
<DOMAIN>
– $ scrapy list
● Import any scrapy class which you required.
– eg. Spider, XPath Selector
● Extend parse() function of a Spider class.
●
Test with scrapy shell console
– $ scrapy shell <URL>
Output / Store collected data
● Use built-in JSON, CSV, XML output at
command line.
– $ scrapy crawl <Spider Name> -t json -o <Output
File>
● Pipelines.py
– Import a Class of your own data structure.
– Extend process_item() function.
– Add to ITEM_PIPELINES at settings.
Django web
framework
Creating django project
$ pip install django
$ django-admin.py startproject <Project name>
myproject
|-- manage.py
|-- myproject
|-- __init__.py
|-- settings.py
|-- urls.py
|-- wsgi.py
Creating django project
● Define django settings.
– Create database, tables and first django user.
● Create your own django app.
– or add existing django apps.
– Create database tables.
● Activate django admin UI.
– Add URL router to access admin UI.
Creating django project
● settings.py
– Define your database connection.
– Add your own app to INSTALLED_APPS.
– Define your own settings.
Create django app
$ cd <Project Name>
$ python manage.py startapp <App Name>
myproject
|-- manage.py
|-- myproject
| |-- __init__.py
| |-- settings.py
| |-- urls.py
| |-- wsgi.py
|-- myapp
|-- admin.py
|-- __init__.py
|-- models.py
|-- tests.py
|-- views.py
Create django app
● Define your own data model.
● Define and activate your admin UI.
● Furthermore:
– Define your data views.
– Addi URL routers to connect with data views.
Define django data model
● Define at models.py.
● Import django data model base class.
● Define your own data model class.
● Create database table(s).
– $ python manage.py syncdb
Define django data model
class WeatherData(models.Model):
reporttime = models.DateTimeField()
station = models.CharField(max_length=3)
temperture = models.FloatField(null=True,
blank=True)
humidity = models.IntegerField(null=True,
blank=True)
Define django data model
● admin.py
– Import admin class
– Import your own data model class.
– Extend admin class for your data model.
– Register admin class
● with admin.site.register() function.
Define django data model
class WeatherDataAdmin(admin.ModelAdmin):
list_display = ('reporttime', 'station',
'temperture', 'humidity', 'windspeed')
list_filter = ['station']
admin.site.register(WeatherData,
WeatherDataAdmin)
Enable django admin ui
● Adding to INSTALLED_APPS at settings.py
– django.contrib.admin
● Adding URL router at urls.py
– $ python manage.py runserver
● Access admin UI
– http://127.0.0.1:8000/admin
Scrapy + Django
Scrapy + Django
● Define django environment at scrapy settings.
– Load django configuration.
● Use Scrapy DjangoItem class
– Insteads of Item and Field class
– Define which django data model should be linked
with.
● Query and insert data at scrapy pipelines.
hk0weather
hk0weather
● Weather Data Project.
– https://github.com/sammyfung/hk0weather
– convert weather information to JSON data from
HKO webpages.
– python + scrapy + django
hk0weather
● Hong Kong Weather Data.
– 20+ HKO weather stations in Hong Kong.
– Regional weather data.
– Rainfall data.
– Weather forecast report.
hk0weather
● Setup and activate a python virtual enviornment,
and install scrapy and django with pip.
● Clone hk0weather from GitHub
– $ git clone https://github.com/sammyfung/hk0weather.git
● Setup database connection at Django and create
database, tables and first django user.
● Scrap regional weather data
– $ scrapy crawl regionalwx -t json -o regional.json
DEMO
Thank you!
http://sammy.hk

Weitere ähnliche Inhalte

Was ist angesagt?

How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
Diep Nguyen
 
Video WebChat Conference Tool
Video WebChat Conference ToolVideo WebChat Conference Tool
Video WebChat Conference Tool
Sergiu Gordienco
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
Audrey Lim
 
CouchDB Open Source Bridge
CouchDB Open Source BridgeCouchDB Open Source Bridge
CouchDB Open Source Bridge
Chris Anderson
 

Was ist angesagt? (20)

Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
Fun with Python
Fun with PythonFun with Python
Fun with Python
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
GDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - SunumGDG İstanbul Şubat Etkinliği - Sunum
GDG İstanbul Şubat Etkinliği - Sunum
 
Video WebChat Conference Tool
Video WebChat Conference ToolVideo WebChat Conference Tool
Video WebChat Conference Tool
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Open Hack London - Introduction to YQL
Open Hack London - Introduction to YQLOpen Hack London - Introduction to YQL
Open Hack London - Introduction to YQL
 
Do something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databasesDo something in 5 with gas 8-copy between databases
Do something in 5 with gas 8-copy between databases
 
Application Logging With The ELK Stack
Application Logging With The ELK StackApplication Logging With The ELK Stack
Application Logging With The ELK Stack
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010Using YQL Sensibly - YUIConf 2010
Using YQL Sensibly - YUIConf 2010
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
CouchDB Open Source Bridge
CouchDB Open Source BridgeCouchDB Open Source Bridge
CouchDB Open Source Bridge
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 

Andere mochten auch

快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
recast203
 
Spider进化论
Spider进化论Spider进化论
Spider进化论
cjhacker
 

Andere mochten auch (18)

Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
香港中文開源軟件翻譯
香港中文開源軟件翻譯香港中文開源軟件翻譯
香港中文開源軟件翻譯
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
From Hk0weather to Open Data
From Hk0weather to Open DataFrom Hk0weather to Open Data
From Hk0weather to Open Data
 
웹크롤러 조사
웹크롤러 조사웹크롤러 조사
웹크롤러 조사
 
摘星
摘星摘星
摘星
 
Big data at scrapinghub
Big data at scrapinghubBig data at scrapinghub
Big data at scrapinghub
 
快快樂樂學 Scrapy
快快樂樂學 Scrapy快快樂樂學 Scrapy
快快樂樂學 Scrapy
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014Quokka CMS - Content Management with Flask and Mongo #tdc2014
Quokka CMS - Content Management with Flask and Mongo #tdc2014
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Spider进化论
Spider进化论Spider进化论
Spider进化论
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
[Week5]R_scraping
[Week5]R_scraping[Week5]R_scraping
[Week5]R_scraping
 

Ähnlich wie Python, web scraping and content management: Scrapy and Django

將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
Chengjen Lee
 
GDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App EngineGDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App Engine
Yared Ayalew
 

Ähnlich wie Python, web scraping and content management: Scrapy and Django (20)

Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Hands on django part 1
Hands on django part 1Hands on django part 1
Hands on django part 1
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Django
DjangoDjango
Django
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Django - basics
Django - basicsDjango - basics
Django - basics
 
Mini Curso de Django
Mini Curso de DjangoMini Curso de Django
Mini Curso de Django
 
بررسی چارچوب جنگو
بررسی چارچوب جنگوبررسی چارچوب جنگو
بررسی چارچوب جنگو
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Use open source software to develop ideas at work
Use open source software to develop ideas at workUse open source software to develop ideas at work
Use open source software to develop ideas at work
 
Mini Curso Django Ii Congresso Academico Ces
Mini Curso Django Ii Congresso Academico CesMini Curso Django Ii Congresso Academico Ces
Mini Curso Django Ii Congresso Academico Ces
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
 
Akash rajguru project report sem v
Akash rajguru project report sem vAkash rajguru project report sem v
Akash rajguru project report sem v
 
Custom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM introCustom web application development with Django for startups and Django-CRM intro
Custom web application development with Django for startups and Django-CRM intro
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
 
GDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App EngineGDG Addis - An Introduction to Django and App Engine
GDG Addis - An Introduction to Django and App Engine
 
Django 1.10.3 Getting started
Django 1.10.3 Getting startedDjango 1.10.3 Getting started
Django 1.10.3 Getting started
 
Introduction to Google Cloud platform technologies
Introduction to Google Cloud platform technologiesIntroduction to Google Cloud platform technologies
Introduction to Google Cloud platform technologies
 

Mehr von Sammy Fung

Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the Web
Sammy Fung
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software Tools
Sammy Fung
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)
Sammy Fung
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMS
Sammy Fung
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
Sammy Fung
 

Mehr von Sammy Fung (20)

Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹Python 爬網⾴工具 - Scrapy 介紹
Python 爬網⾴工具 - Scrapy 介紹
 
DevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to onlineDevRel - Transform article writing from printing to online
DevRel - Transform article writing from printing to online
 
Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)Introduction to Open Source by opensource.hk (2019 Edition)
Introduction to Open Source by opensource.hk (2019 Edition)
 
My Open Source Journey - Developer and Community
My Open Source Journey - Developer and CommunityMy Open Source Journey - Developer and Community
My Open Source Journey - Developer and Community
 
Introduction to development with Django web framework
Introduction to development with Django web frameworkIntroduction to development with Django web framework
Introduction to development with Django web framework
 
Open Data and Web API
Open Data and Web APIOpen Data and Web API
Open Data and Web API
 
Global Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 ForecastGlobal Open Source Development 2011-2014 Review and 2015 Forecast
Global Open Source Development 2011-2014 Review and 2015 Forecast
 
Mozilla - Openness of the Web
Mozilla - Openness of the WebMozilla - Openness of the Web
Mozilla - Openness of the Web
 
Open Source Technology and Community
Open Source Technology and CommunityOpen Source Technology and Community
Open Source Technology and Community
 
Access Open Data with Open Source Software Tools
Access Open Data with Open Source Software ToolsAccess Open Data with Open Source Software Tools
Access Open Data with Open Source Software Tools
 
Installation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server EditionInstallation of LAMP Server with Ubuntu 14.10 Server Edition
Installation of LAMP Server with Ubuntu 14.10 Server Edition
 
Software Freedom and Open Source Community
Software Freedom and Open Source CommunitySoftware Freedom and Open Source Community
Software Freedom and Open Source Community
 
Building your own job site with Drupal
Building your own job site with DrupalBuilding your own job site with Drupal
Building your own job site with Drupal
 
Software Freedom and Community
Software Freedom and CommunitySoftware Freedom and Community
Software Freedom and Community
 
Open Source Job Board
Open Source Job BoardOpen Source Job Board
Open Source Job Board
 
Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)Introduction of Mozilla Hong Kong (COSCUP 2014)
Introduction of Mozilla Hong Kong (COSCUP 2014)
 
Introduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMSIntroduction of Open Source Job Board with Drupal CMS
Introduction of Open Source Job Board with Drupal CMS
 
Local Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell ExtensionLocal Weather Information and GNOME Shell Extension
Local Weather Information and GNOME Shell Extension
 
Mozilla Community and Hong Kong
Mozilla Community and Hong KongMozilla Community and Hong Kong
Mozilla Community and Hong Kong
 
ITFest 2014 - Open Source Marketing
ITFest 2014 - Open Source MarketingITFest 2014 - Open Source Marketing
ITFest 2014 - Open Source Marketing
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Python, web scraping and content management: Scrapy and Django

  • 1. Python, Web Scraping and Content Management: Scrapy and Django Sammy Fung http://sammy.hk OpenSource.HK Workshop 2014.07.05
  • 2. Sammy Fung ● Perl → PHP → Python ● Linux → Open Source → Open Data ● Freelance → Startup ● http://sammy.hk ● sammy@sammy.hk
  • 4. Can computer program read this ?
  • 5. Is this UI easy understanding ?
  • 6.
  • 7.
  • 8.
  • 9. Five Star Open Data 1.make your stuff available on the Web (whatever format) under an open license. 2.make it available as structured data (e.g., Excel instead of image scan of a table) 3.use non-proprietary formats (e.g., CSV instead of Excel) 4.use URIs to denote things, so that people can point at your stuff. 5.link your data to other data to provide context. 5stardata.info by Tim Berners-Lee, the inventor of the Web.
  • 10. Open Data ● Data.One – Lead by OGCIO of Hong Kong Government. – Use the term “public sector information” (PSI) insteads of “open data”. – Many data are not available in machine-readable format with useful data structure. – A lot of data are still requiring web scraping with customized data extraction to collect useful machine-readable data.
  • 12. Web Scraping a computer software technique of extracting information from websites. (Wikipedia)
  • 13. Scrapy ● Python. ● Open source web scraping framework. ● Scrap websites and extract structured data. ● From data mining to monitoring and automated testing.
  • 14. Scrapy ● Define your own data structures. ● Write spiders to extract data. ● Built-in XPath selectors to extracting data. ● Built-in JSON, CSV, XML output. ● Interactive shell console, telnet console, logging......
  • 15. scrapyd ● Scrapy web service daemon. ● pip install scrapyd ● Web API with simple Web UI: – http://localhost:6800 ● Web API Documentation: – http://scrapyd.readthedocs.org/en/latest/api.html
  • 16. scrapyd ● Examples: – curl http://localhost:6800/listprojects.json – curl http://localhost:6800/listspiders.json? project=default ● eg. {"status": "ok", "spiders": ["pollutant24", "aqhi24"]}
  • 17. Scrapy Installation $ apt-get install python python-virtualenv python-pip $ virtualenv env $ source env/bin/activate $ pip install scrapy
  • 18. Creating Scrapy Project $ scrapy startproject <new project name> newproject |-- newproject | |-- __init__.py | |-- items.py | |-- pipelines.py | |-- settings.py | |-- spider | |- __init__.py |-- scrapy.cfg
  • 19. Creating Scrapy Project ● Define your data structure ● Write your first spider – Test with scrapy shell console ● Output / Store collected data – Output with built-in supported formats – Store to database / object store.
  • 20. Define your data structure items.py class Hk0WeatherItem(Item): reporttime = Field() station = Field() temperture = Field() humidity = Field()
  • 21. Write your first spider ● Import a Class of your own data structure. – $ scrapy genspider -t basic <YOUR SPIDER NAME> <DOMAIN> – $ scrapy list ● Import any scrapy class which you required. – eg. Spider, XPath Selector ● Extend parse() function of a Spider class. ● Test with scrapy shell console – $ scrapy shell <URL>
  • 22. Output / Store collected data ● Use built-in JSON, CSV, XML output at command line. – $ scrapy crawl <Spider Name> -t json -o <Output File> ● Pipelines.py – Import a Class of your own data structure. – Extend process_item() function. – Add to ITEM_PIPELINES at settings.
  • 24. Creating django project $ pip install django $ django-admin.py startproject <Project name> myproject |-- manage.py |-- myproject |-- __init__.py |-- settings.py |-- urls.py |-- wsgi.py
  • 25. Creating django project ● Define django settings. – Create database, tables and first django user. ● Create your own django app. – or add existing django apps. – Create database tables. ● Activate django admin UI. – Add URL router to access admin UI.
  • 26. Creating django project ● settings.py – Define your database connection. – Add your own app to INSTALLED_APPS. – Define your own settings.
  • 27. Create django app $ cd <Project Name> $ python manage.py startapp <App Name> myproject |-- manage.py |-- myproject | |-- __init__.py | |-- settings.py | |-- urls.py | |-- wsgi.py |-- myapp |-- admin.py |-- __init__.py |-- models.py |-- tests.py |-- views.py
  • 28. Create django app ● Define your own data model. ● Define and activate your admin UI. ● Furthermore: – Define your data views. – Addi URL routers to connect with data views.
  • 29. Define django data model ● Define at models.py. ● Import django data model base class. ● Define your own data model class. ● Create database table(s). – $ python manage.py syncdb
  • 30. Define django data model class WeatherData(models.Model): reporttime = models.DateTimeField() station = models.CharField(max_length=3) temperture = models.FloatField(null=True, blank=True) humidity = models.IntegerField(null=True, blank=True)
  • 31. Define django data model ● admin.py – Import admin class – Import your own data model class. – Extend admin class for your data model. – Register admin class ● with admin.site.register() function.
  • 32. Define django data model class WeatherDataAdmin(admin.ModelAdmin): list_display = ('reporttime', 'station', 'temperture', 'humidity', 'windspeed') list_filter = ['station'] admin.site.register(WeatherData, WeatherDataAdmin)
  • 33. Enable django admin ui ● Adding to INSTALLED_APPS at settings.py – django.contrib.admin ● Adding URL router at urls.py – $ python manage.py runserver ● Access admin UI – http://127.0.0.1:8000/admin
  • 35. Scrapy + Django ● Define django environment at scrapy settings. – Load django configuration. ● Use Scrapy DjangoItem class – Insteads of Item and Field class – Define which django data model should be linked with. ● Query and insert data at scrapy pipelines.
  • 37. hk0weather ● Weather Data Project. – https://github.com/sammyfung/hk0weather – convert weather information to JSON data from HKO webpages. – python + scrapy + django
  • 38. hk0weather ● Hong Kong Weather Data. – 20+ HKO weather stations in Hong Kong. – Regional weather data. – Rainfall data. – Weather forecast report.
  • 39. hk0weather ● Setup and activate a python virtual enviornment, and install scrapy and django with pip. ● Clone hk0weather from GitHub – $ git clone https://github.com/sammyfung/hk0weather.git ● Setup database connection at Django and create database, tables and first django user. ● Scrap regional weather data – $ scrapy crawl regionalwx -t json -o regional.json
  • 40. DEMO