SlideShare a Scribd company logo
1 of 15
WEB SCRAPING
Dmytro Nekh
- Data scraping
- Types of data scraping
- Web scraping
- Process of web scraping
Data scraping
Data scraping - is a technique in which a computer
program extracts data from human-readable output
coming from another program.
Types of data scraping
Screen scraping is the method of collecting screen display data from
one application and translating it so that another application is able to
display it.
Report mining is the extraction of data from human readable
computer reports.
Web scraping is a web technique of extracting data from the web, and
turning unstructured data on the web (including HTML formats) into
structured data that you can store to your local computer or a
database.
Types of data scraping
Screen scraping is the method of collecting screen display data from
one application and translating it so that another application is able to
display it.
Report mining is the extraction of data from human readable
computer reports.
Web scraping is a web technique of extracting data from the web, and
turning unstructured data on the web (including HTML formats) into
structured data that you can store to your local computer or a
database.
Manual scraping: Copy-paste technique
Text Pattern Matching
This is a regular expression-matching technique using the UNIX grep
command, and clubbed with popular programming languages
message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.'
for i in range(len(message)):
chunk = message[i:i+12]
if isPhoneNumber(chunk):
print('Phone number found: ' + chunk)
Computer vision web-page analysis
There are efforts using machine learning and
computer vision that attempt to identify and extract
information from web pages by interpreting pages
visually as a human being might.
Vertical Aggregation
Vertical aggregation platforms are created by companies with huge
computing power, targeting a specific verticals. Some even run these
data harvesting platforms on the cloud. Creation and monitoring of bots
for specific verticals is done by these platforms, with virtually no human
intervention. Since the bots are created automatically based on the
knowledge base for the specific vertical, the efficiency of the bots is
measured by the quality of data extracted.
HTML Parsing
HTML parsing is done using Java scripts, and targets linear or nested HTML pages. This fast and
robust method is used for text extraction, link extraction (for example, nested links or email
addresses), resource extraction, and so on.
DOM Parsing
Document Object Model, or
DOM, defines the style,
structure and the contents
contained within the XML
files. DOM parsers are
generally used by scrapers
that want to get an in-depth
view of the structure of the
web page. One can use the
DOM parser to get the nodes
containing information, and
then use a tool like XPath to
scrape web pages.
Simple DOM Parser
Simple DOM Parser
Tools for web scraping
- Selenium
- Import.io
- Phantom.js
- Scrapy
- etc.
Web Scraping Techniques and Process Explained

More Related Content

What's hot

Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine LearningSamra Shahzadi
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Chapter 5 IoT Design methodologies
Chapter 5 IoT Design methodologiesChapter 5 IoT Design methodologies
Chapter 5 IoT Design methodologiespavan penugonda
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 
Introduction to AI & ML
Introduction to AI & MLIntroduction to AI & ML
Introduction to AI & MLMandy Sidana
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaEdureka!
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt Poojamanic
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?Yu-Chang Ho
 
e-commerce web site project
e-commerce web site projecte-commerce web site project
e-commerce web site projectMahmudul Hasan
 
edge computing seminar report.pdf
edge computing seminar report.pdfedge computing seminar report.pdf
edge computing seminar report.pdffirstlast467690
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learningdataalcott
 
Driver Drowsiness Detection report
Driver Drowsiness Detection reportDriver Drowsiness Detection report
Driver Drowsiness Detection reportPurvanshJain1
 

What's hot (20)

Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Chapter 5 IoT Design methodologies
Chapter 5 IoT Design methodologiesChapter 5 IoT Design methodologies
Chapter 5 IoT Design methodologies
 
Edge Computing.pptx
Edge Computing.pptxEdge Computing.pptx
Edge Computing.pptx
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Introduction to AI & ML
Introduction to AI & MLIntroduction to AI & ML
Introduction to AI & ML
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
 
web mining
web miningweb mining
web mining
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
Machine learning ppt
Machine learning ppt Machine learning ppt
Machine learning ppt
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
e-commerce web site project
e-commerce web site projecte-commerce web site project
e-commerce web site project
 
edge computing seminar report.pdf
edge computing seminar report.pdfedge computing seminar report.pdf
edge computing seminar report.pdf
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Credit card fraud detection through machine learning
Credit card fraud detection through machine learningCredit card fraud detection through machine learning
Credit card fraud detection through machine learning
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
Driver Drowsiness Detection report
Driver Drowsiness Detection reportDriver Drowsiness Detection report
Driver Drowsiness Detection report
 

Similar to Web Scraping Techniques and Process Explained

What are the different types of web scraping approaches
What are the different types of web scraping approachesWhat are the different types of web scraping approaches
What are the different types of web scraping approachesAparna Sharma
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...IOSR Journals
 
Unsupervised approach to deduce schema and extract data from template web pages
Unsupervised approach to deduce schema and extract data from template web pagesUnsupervised approach to deduce schema and extract data from template web pages
Unsupervised approach to deduce schema and extract data from template web pagesIAEME Publication
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...eSAT Journals
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...eSAT Publishing House
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
 
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING ...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM  FOR E-COMMERCE WEBSITES USERS USING ...DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM  FOR E-COMMERCE WEBSITES USERS USING ...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING ...kevig
 
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...ijnlc
 
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program haiderali8455
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis Vikram Parmar
 
Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech QuotientTarence DSouza
 
ACOMP_2014_submission_70
ACOMP_2014_submission_70ACOMP_2014_submission_70
ACOMP_2014_submission_70David Nguyen
 
Icon based addressbook and content adaptation
Icon based addressbook and content adaptationIcon based addressbook and content adaptation
Icon based addressbook and content adaptationAnjan Mondal
 

Similar to Web Scraping Techniques and Process Explained (20)

What are the different types of web scraping approaches
What are the different types of web scraping approachesWhat are the different types of web scraping approaches
What are the different types of web scraping approaches
 
Implementation of Web Application for Disease Prediction Using AI
Implementation of Web Application for Disease Prediction Using AIImplementation of Web Application for Disease Prediction Using AI
Implementation of Web Application for Disease Prediction Using AI
 
Implementation ofWeb Application for Disease Prediction Using AI
Implementation ofWeb Application for Disease Prediction Using AIImplementation ofWeb Application for Disease Prediction Using AI
Implementation ofWeb Application for Disease Prediction Using AI
 
Web Scraping Services.pptx
Web Scraping Services.pptxWeb Scraping Services.pptx
Web Scraping Services.pptx
 
IGCSE ICT Theory
IGCSE ICT Theory IGCSE ICT Theory
IGCSE ICT Theory
 
Nadee2018
Nadee2018Nadee2018
Nadee2018
 
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
Advance Frameworks for Hidden Web Retrieval Using Innovative Vision-Based Pag...
 
Unsupervised approach to deduce schema and extract data from template web pages
Unsupervised approach to deduce schema and extract data from template web pagesUnsupervised approach to deduce schema and extract data from template web pages
Unsupervised approach to deduce schema and extract data from template web pages
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING ...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM  FOR E-COMMERCE WEBSITES USERS USING ...DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM  FOR E-COMMERCE WEBSITES USERS USING ...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING ...
 
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...
DEVELOPING PRODUCTS UPDATE-ALERT SYSTEM FOR E-COMMERCE WEBSITES USERS USING H...
 
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program
COMP-111 | Past Paper 2020 Long Question Solution PU BS 4 Year Program
 
Technical Comptency_ppt
Technical Comptency_pptTechnical Comptency_ppt
Technical Comptency_ppt
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 
PeopleSoft
PeopleSoftPeopleSoft
PeopleSoft
 
Improve your Tech Quotient
Improve your Tech QuotientImprove your Tech Quotient
Improve your Tech Quotient
 
ACOMP_2014_submission_70
ACOMP_2014_submission_70ACOMP_2014_submission_70
ACOMP_2014_submission_70
 
Icon based addressbook and content adaptation
Icon based addressbook and content adaptationIcon based addressbook and content adaptation
Icon based addressbook and content adaptation
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Web Scraping Techniques and Process Explained

  • 2. - Data scraping - Types of data scraping - Web scraping - Process of web scraping
  • 3. Data scraping Data scraping - is a technique in which a computer program extracts data from human-readable output coming from another program.
  • 4. Types of data scraping Screen scraping is the method of collecting screen display data from one application and translating it so that another application is able to display it. Report mining is the extraction of data from human readable computer reports. Web scraping is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database.
  • 5. Types of data scraping Screen scraping is the method of collecting screen display data from one application and translating it so that another application is able to display it. Report mining is the extraction of data from human readable computer reports. Web scraping is a web technique of extracting data from the web, and turning unstructured data on the web (including HTML formats) into structured data that you can store to your local computer or a database.
  • 7. Text Pattern Matching This is a regular expression-matching technique using the UNIX grep command, and clubbed with popular programming languages message = 'Call me at 415-555-1011 tomorrow. 415-555-9999 is my office.' for i in range(len(message)): chunk = message[i:i+12] if isPhoneNumber(chunk): print('Phone number found: ' + chunk)
  • 8. Computer vision web-page analysis There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.
  • 9. Vertical Aggregation Vertical aggregation platforms are created by companies with huge computing power, targeting a specific verticals. Some even run these data harvesting platforms on the cloud. Creation and monitoring of bots for specific verticals is done by these platforms, with virtually no human intervention. Since the bots are created automatically based on the knowledge base for the specific vertical, the efficiency of the bots is measured by the quality of data extracted.
  • 10. HTML Parsing HTML parsing is done using Java scripts, and targets linear or nested HTML pages. This fast and robust method is used for text extraction, link extraction (for example, nested links or email addresses), resource extraction, and so on.
  • 11. DOM Parsing Document Object Model, or DOM, defines the style, structure and the contents contained within the XML files. DOM parsers are generally used by scrapers that want to get an in-depth view of the structure of the web page. One can use the DOM parser to get the nodes containing information, and then use a tool like XPath to scrape web pages.
  • 14. Tools for web scraping - Selenium - Import.io - Phantom.js - Scrapy - etc.