WEB SCRAPING TO COLLECT DATA FROM ETL WITH PIPELINE
TEAM MEMBERS REGISTER NUMBER
SHIFANA FARVEEN. I 19TD1535
Under the guidance of
Department of CSE
S ignificance of Proposed model:
Mainly deals with SQL checks on data to ensure that the data flowing in
and flowing out are inline with the organizational requirements.
Reduced data loss
Provides timely access
Using ETL a splendid future is growing exponentially
Generate reproducible code
Distributed “Big Data” computation
Scaling a working pipeline
Brings structure to your information as well as contributes to its clarity,
completeness, quality, and velocity
Data formats changing over time
Broken data connections
Contradictions between systems
Addressing the issues of different ETL components with the same
Not considering data scaling
Failing to anticipate future data needs
Most companies and businesses acquire data from a variety of sources, such as
CRM files, ERP files, emails, Excel sheets, Word documents, log files data.
During extraction, the ETL tool uses various connectors to extract relevant raw data
from their respective sources.
Even though it is possible to manually extract data, it is a time-consuming and error-
prone process. With the ETL tool, this extraction stage is made easier and faster.
After data extraction, create APIs to tranform them into a format of a destination
system as input.
Data loading is the process where the newly transformed data is collectively
loaded into a new location.
Full load — loading entire data from source to destination. Suitable for smaller
Incremental/Delta Load — loading only the data from source which are not
available in destination. Suitable if soure datasize is huge. Usually it is
implemented based on date.
Steps on Proposed work:
Web Scraping-extracting valuable and intersting information from web
Mainly targeting task are about automated web data extraction.
Data can be extracted through the various source links.
Inspector-parsing HTML entails identifying HTML elements and
Parsing HTML Links with Beautiful soup
Creates a parse tree for parsed pages that can be used to extract data from
Extract data using Pandas and Requests
Get extracted data with Request and Pandas
Cleaning and Merging Scraped Data With Pandas
Scrapping Data for Multiple Seasons and Teams with a Loop
Final data Results and DataFrame