1. INTRODUCTION TO
WEB SCRAPING USING
PYTHON
Submitted By
www.computersciencejunction.in
www.computersciencejunction.i
n
2. Content
• What is Web Scraping?
• Need of Web Scraping.
• Workflow
• Libraries used.
• Why Python for web scraping.
• Demo (Scrape a Website)
• Advantages of web scraping.
• Limitations of web scraping.
www.computerscienc
ejunction.in
3. Web Scraping
Web Scraping is a technique to fetch data
and
information from websites.
Everything you see on a webpage can be
scraped.
Can be done in most programming
languages,
we’ll use Python because it is easier with
Python.
www.computersciencejunction.in
4. Need of Web Scraping
• Web scraping, or web content extraction, can
serve an unlimited number of purposes
Better access to company data
Market analysis at scale
Machine learning and large datasets.
Stock Market Tracking
Tracking latest trends
www.computerscienc
ejunction.in
6. Continued..
Send Request and Load the webpage.
(Requests, urllib, httplib)
Parse the content for desired data.
(Beautiful Soup, re, Scrapy)
Store the data the way you want.
www.computerscienc
ejunction.in
7. Libraries Used
Selenium
Selenium is a web testing library. It is used to
automate browser activities.
BeautifulSoup
Beautiful Soup is a Python package for parsing
HTML and XML documents. It creates parse trees
that is helpful to extract the data easily.
Pandas
Pandas is a library used for data manipulation and
analysis.
www.computerscienc
ejunction.in
8. Why python for web
scraping?
• Ease of Use.
• Large Collection of Libraries.
• Dynamically typed.
• Easily Understandable Syntax.
• Small code, large task.
• Community
www.computerscienc
ejunction.in
9. • Step 3: Find the data you want to extract
Let’s extract the Price, Name, and Rating which is
nested in the “div” tag respectively.
• Step 4: Write the code.
First, let’s create a Python file. To do this, open the
terminal in Windows and type gedit <your file name>
with .py extension.
www.computerscienc
ejunction.in
11. • page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
# find results within table
table = soup.find('div', attrs={'class': 'review_table'}).find('ul',attrs={'class':
'audience-reviews'})
article=[]
for i in table.findAll('li',attrs={'class': 'audience-reviews__item'}):
z=""
z=z+i.find('p',attrs={'class':'audience-reviews__review--mobile js-review-
text clamp clamp-4 js-clamp'}).getText()
article.append(z)
www.computerscienc
ejunction.in
12. Continued...
Step 5: Run the code and extract the data.
Step 6: Store the data in a required format.
Example-:
df=pd.DataFrame({'ProductName':products,'
Price':prices,'Rating':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')
www.computerscienc
ejunction.in
14. Limitations of web scraping
• Difficult to analyze
For anybody who is not an expert, the scraping
processes are confusing to understand. Although this is not
a major problem, but some errors could be fixed faster if it
was easier to understand for more software developers.
• Time
Sometimes web scraping services take time to become
familiar with the core application and need to adjust to the
scrapping language.
• Speed and protection policies
www.computerscienc
ejunction.in