SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Intro to Data
ScrapingPRESENTED BY
DAVID SELASSIE OPOKU
@sdopoku
13 July 2015
Outline
1. Target audience
2. What is and Why Data Scraping?
3. Use cases
4. Basic steps & Best practices
5. Tools
6. Reference Resources
Target
Audience
This should be useful to ...
● Non-tech-savvy data journalists
● Advanced data journalists
● Web developers & data publishers
● School of Data fellows
● Open Data enthusiasts
What is &
Why Data
Scraping ?
Data Scraping: what is it ?
scrape [ verb ˈskrāp ]
: to remove from a surface by usually repeated strokes of an edged instrument
: to collect by or as if by scraping —often used with up or together <scrape up the
price of a ticket>
- Merriam Webster
“The transformation of unstructured data on the web, typically in HTML format, into
structured data that can be stored and analyzed in a central local database or
spreadsheet.”
- Wikipedia (web scraping)
When should you scrape data ?
● PDF Data
● HTML data
Machine-readable data
Example
Use Cases
Cases when you can scrape
● Create a dataset for a data workshop
● Create a database for a data -driven app
● Create a data visualisation for a story
Best
Practices
Best Practices For Scrapers
1. Scraping is not scary!
a. Use existing tools
2. Use a modern and friendly browser
a. Chrome, Firefox, Opera, Safari
b. Avoid Internet Explorer
3. Map out the process
a. Where does scraping fit in?
Best Practices For Data Publishers
1. Have a consistent structure
a. Websites
b. PDFs
2. Always think about your data end users
a. Before, during & after publishing
Steps
1. Map out the process/pipeline for your data project
2. Identify your data source (website, PDF, API?)
3. Decide on storage format for your scraped data
a. CSV file, Spreadsheet, Google docs
b. Database
4. Select scraping tool
5. Verify and Clean data
Tools
Tools: Web Browsers
Tools: Scraping Apps
1. Point and click
a. Scraper Google Chrome extension
b. ScraperWiki (Classic version)
c. Import.io, Kimono Labs, Webscraper.io
d. Tabula (PDF)
2. Programming (Python libraries)
a. Beautiful Soup
b. Pattern (PDF and HTML)
c. Scrapy
Tools: Storage & Sharing
1. Google Spreadsheets
2. Github
3. Datahub.io
Resources - Readings and Tools
1. Five data scraping tools for would-be data journalists
2. Making data on the web useful: scraping
3. Liberating HTML Data Tables
4. BeautifulSoup
5. Pattern
6. Scrapy
7. Datahub
8. Import.io
9. Kimono
10. Webscraper.io
11. Tabula

Weitere ähnliche Inhalte

Was ist angesagt?

Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingCynthiaCruz55
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in PythonSatwik Kansal
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automationBHAWESH RAJPAL
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data VisualizationRaffael Marty
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introductionManokamnaKochar1
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 

Was ist angesagt? (20)

Web scraping
Web scrapingWeb scraping
Web scraping
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Search engine
Search engineSearch engine
Search engine
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automation
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Web mining
Web mining Web mining
Web mining
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introduction
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Data Visualization.pptx
Data Visualization.pptxData Visualization.pptx
Data Visualization.pptx
 

Ähnlich wie Skillshare - Introduction to Data Scraping

Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020Sarah Jones
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...Projeto RCAAP
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxASIMKHAN840563
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data ManagementSarah Jones
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel geektimecoil
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesJennifer Muilenburg
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" J T "Tom" Johnson
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATTony Ross-Hellauer
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATOpenAIRE
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | EUDAT
 

Ähnlich wie Skillshare - Introduction to Data Scraping (20)

Data Management and Horizon 2020
Data Management and Horizon 2020Data Management and Horizon 2020
Data Management and Horizon 2020
 
The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...The state of global research data initiatives: observations from a life on th...
The state of global research data initiatives: observations from a life on th...
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Python Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptxPython Web Scraper for ACM and Google Scholar.pptx
Python Web Scraper for ACM and Google Scholar.pptx
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
DMP & DMPonline
DMP & DMPonlineDMP & DMPonline
DMP & DMPonline
 
What is-rdm
What is-rdmWhat is-rdm
What is-rdm
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel   Big data and machine learning / Gil Chamiel
Big data and machine learning / Gil Chamiel
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
Manage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW LibrariesManage Your Data! Navigating Data Services at the UW Libraries
Manage Your Data! Navigating Data Services at the UW Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
"Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption" "Using Web 2.0 as a Weapon Against Corruption"
"Using Web 2.0 as a Weapon Against Corruption"
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu | Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
Research Data Management Introduction: EUDAT/Open AIRE Webinar| www.eudat.eu |
 

Mehr von School of Data

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?School of Data
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSchool of Data
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives dataSchool of Data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSchool of Data
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSchool of Data
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSchool of Data
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSchool of Data
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSchool of Data
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsSchool of Data
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data JournalismSchool of Data
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events School of Data
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptxSchool of Data
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiSchool of Data
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenSchool of Data
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of DataSchool of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of DataSchool of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of DataSchool of Data
 

Mehr von School of Data (20)

School of Data - What is it?
School of Data - What is it?School of Data - What is it?
School of Data - What is it?
 
Skillshare - Creating Excel Dashboards
Skillshare - Creating Excel DashboardsSkillshare - Creating Excel Dashboards
Skillshare - Creating Excel Dashboards
 
Skillshare - Understanding extractives data
Skillshare - Understanding extractives dataSkillshare - Understanding extractives data
Skillshare - Understanding extractives data
 
Skillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data JournalismSkillshare - Regression Analysis for Data Journalism
Skillshare - Regression Analysis for Data Journalism
 
Skillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in NigeriaSkillshare - Building a data literacy community in Nigeria
Skillshare - Building a data literacy community in Nigeria
 
Skillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collectionSkillshare - Using Kobo Toolbox for mobile data collection
Skillshare - Using Kobo Toolbox for mobile data collection
 
Skillshare - Introduction to Timemapper
Skillshare - Introduction to TimemapperSkillshare - Introduction to Timemapper
Skillshare - Introduction to Timemapper
 
Skillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data JournalismSkillshare - Let's talk about R in Data Journalism
Skillshare - Let's talk about R in Data Journalism
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
From data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and chartsFrom data to diagrams: an introduction to basic graphs and charts
From data to diagrams: an introduction to basic graphs and charts
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
 
Skillshare getting feedback from training events
Skillshare  getting feedback from training events Skillshare  getting feedback from training events
Skillshare getting feedback from training events
 
Photography tips
Photography tipsPhotography tips
Photography tips
 
Activism through the lens [english].pptx
Activism through the lens [english].pptxActivism through the lens [english].pptx
Activism through the lens [english].pptx
 
Gamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra IsmiraldiGamification skillshare by Yuandra Ismiraldi
Gamification skillshare by Yuandra Ismiraldi
 
Facilitation skill share by Happy Feraren
Facilitation skill share by Happy FerarenFacilitation skill share by Happy Feraren
Facilitation skill share by Happy Feraren
 
UX presentation
UX presentationUX presentation
UX presentation
 
Mapping Skillshare with School of Data
Mapping Skillshare with School of DataMapping Skillshare with School of Data
Mapping Skillshare with School of Data
 
Data Visualization & Design with School of Data
Data Visualization & Design with School of DataData Visualization & Design with School of Data
Data Visualization & Design with School of Data
 
Network mapping with School of Data
Network mapping with School of DataNetwork mapping with School of Data
Network mapping with School of Data
 

Kürzlich hochgeladen

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 

Kürzlich hochgeladen (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

Skillshare - Introduction to Data Scraping

  • 1. Intro to Data ScrapingPRESENTED BY DAVID SELASSIE OPOKU @sdopoku 13 July 2015
  • 2. Outline 1. Target audience 2. What is and Why Data Scraping? 3. Use cases 4. Basic steps & Best practices 5. Tools 6. Reference Resources
  • 4. This should be useful to ... ● Non-tech-savvy data journalists ● Advanced data journalists ● Web developers & data publishers ● School of Data fellows ● Open Data enthusiasts
  • 5. What is & Why Data Scraping ?
  • 6. Data Scraping: what is it ? scrape [ verb ˈskrāp ] : to remove from a surface by usually repeated strokes of an edged instrument : to collect by or as if by scraping —often used with up or together <scrape up the price of a ticket> - Merriam Webster “The transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.” - Wikipedia (web scraping)
  • 7. When should you scrape data ? ● PDF Data ● HTML data Machine-readable data
  • 9. Cases when you can scrape ● Create a dataset for a data workshop ● Create a database for a data -driven app ● Create a data visualisation for a story
  • 11. Best Practices For Scrapers 1. Scraping is not scary! a. Use existing tools 2. Use a modern and friendly browser a. Chrome, Firefox, Opera, Safari b. Avoid Internet Explorer 3. Map out the process a. Where does scraping fit in?
  • 12. Best Practices For Data Publishers 1. Have a consistent structure a. Websites b. PDFs 2. Always think about your data end users a. Before, during & after publishing
  • 13. Steps 1. Map out the process/pipeline for your data project 2. Identify your data source (website, PDF, API?) 3. Decide on storage format for your scraped data a. CSV file, Spreadsheet, Google docs b. Database 4. Select scraping tool 5. Verify and Clean data
  • 14. Tools
  • 16. Tools: Scraping Apps 1. Point and click a. Scraper Google Chrome extension b. ScraperWiki (Classic version) c. Import.io, Kimono Labs, Webscraper.io d. Tabula (PDF) 2. Programming (Python libraries) a. Beautiful Soup b. Pattern (PDF and HTML) c. Scrapy
  • 17. Tools: Storage & Sharing 1. Google Spreadsheets 2. Github 3. Datahub.io
  • 18. Resources - Readings and Tools 1. Five data scraping tools for would-be data journalists 2. Making data on the web useful: scraping 3. Liberating HTML Data Tables 4. BeautifulSoup 5. Pattern 6. Scrapy 7. Datahub 8. Import.io 9. Kimono 10. Webscraper.io 11. Tabula