WEB SCRAPING TO COLLECT DATA FROM ETL WITH PIPELINE
TEAM MEMBERS REGISTER NUMBER
PREETHA.K 19TD1526
SHIFANA FARVEEN. I 19TD1535
SUBASRI.S 19TD1538
Under the guidance of
Mrs.S.SARANYA…,
Assitant Professor
Department of CSE
RAAKCET.
S ignificance of Proposed model:
Mainly deals with SQL checks on data to ensure that the data flowing in
and flowing out are inline with the organizational requirements.
Data quality
Reduced data loss
Provides timely access
Using ETL a splendid future is growing exponentially
Generate reproducible code
Distributed “Big Data” computation
Scaling a working pipeline
7
Technique:
Brings structure to your information as well as contributes to its clarity,
completeness, quality, and velocity
Data formats changing over time
Broken data connections
Contradictions between systems
Addressing the issues of different ETL components with the same
technology
Not considering data scaling
Failing to anticipate future data needs
8
10
Extract:
Most companies and businesses acquire data from a variety of sources, such as
CRM files, ERP files, emails, Excel sheets, Word documents, log files data.
During extraction, the ETL tool uses various connectors to extract relevant raw data
from their respective sources.
Even though it is possible to manually extract data, it is a time-consuming and error-
prone process. With the ETL tool, this extraction stage is made easier and faster.
Tranform:
After data extraction, create APIs to tranform them into a format of a destination
system as input.
Cleaning
11
Deduplication
Format revision
Key restructuring,etc,…
Load:
Data loading is the process where the newly transformed data is collectively
loaded into a new location.
Full load — loading entire data from source to destination. Suitable for smaller
source data
Incremental/Delta Load — loading only the data from source which are not
available in destination. Suitable if soure datasize is huge. Usually it is
implemented based on date.
Steps on Proposed work:
Web Scraping-extracting valuable and intersting information from web
pages
Mainly targeting task are about automated web data extraction.
Data can be extracted through the various source links.
Inspector-parsing HTML entails identifying HTML elements and
associated tags.
Parsing HTML Links with Beautiful soup
Creates a parse tree for parsed pages that can be used to extract data from
HTML
Extract data using Pandas and Requests
12
13
Get extracted data with Request and Pandas
Cleaning and Merging Scraped Data With Pandas
Scrapping Data for Multiple Seasons and Teams with a Loop
Final data Results and DataFrame