SlideShare a Scribd company logo
1 of 36
Download to read offline
Parse the web
    using Python + Beautiful Soup




                     at ncucc
                 cwebb(dot)tw(at)gmail(dot)com
Agenda

•
• Python
• Beautiful Soup
Parse the web?
            but how?
Solutions

• C++
• Java
• Perl
• Python
• Others?
Solutions (Cont.)

•
• Regular expression
•        Parser
So I decide...
Python + Beautiful Soup
Python + Beautiful Soup
Python

• high-level programming language
• scripting language
•         Google
•
•               {}
• list tuple dictionary
list
• a=[‘asdf’,123,12.01,‘abcd’]
• a[3] (a[-1])
 • 12.01
• a[0:2] (a[:2])
 • [‘asdf’,123,12.01]
• b=[‘asdf’,123,[‘qwer’,12.34]]
list (Cont.)
• a=[‘abc’,12]
• len(a)
• #2
• a.append(1)
• #[‘abc’,12,1]
• a.insert(1,‘def’)
• #[‘abc’,‘def’,12,1]
list (Cont.)
• a= [321,456,12,1]
• a.pop()
• #[321,456,12]
• a.index(12)
• #2
• a.sort()
• #1,12,321,456]
tuple

• a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01
• a=((‘abc’,1),123.1)
• a,b=1,2
Dictionary

• a={123:‘abc’,‘cde’:456}
• a[123]
• #abc’
• a[‘cde’]
• #456
if else
if a>10:
   print ‘a>10’
elif a<5:
   print ‘a<5’
else:
   print ‘5<a<10
while loop
while a>2 or a<3:
 pass
for loop
a=[‘abc’,123,‘def’]        abc
for x in a:                123
  print x                  def

                           0
for x in range(3):
                           1
  print x
                           2

                           4
for x in range(4,34,10):
                           14
  print x
                           24
function
def fib(n):
 if n==0 or n==1:
    return n
 else:
    return fib(n-1),fib(n-2)
....
What is Beautiful Soup
                    not Beautiful Soap


• python module
• html/xml parser
• html/xml
•
Beautiful Soup
<html>
 <head>
  <title>
    page title
  </title>
 </head>
 <body>
  <p id=quot;firstparaquot; align=quot;centerquot;>
    first paragraph
    <b>
     one
    </b>
  </p>
  <p id=quot;secondparaquot; align=quot;blahquot;>
    second paragraph
    <b>
     two
    </b>
  </p>
 </body>
</html>
check urllib/urllib2 to see
                                           how to open a url in python

from BeautifulSoup import BeautifulSoup
soup=BeautifulSoup(page)

soup.html.head
#<head><title>page title</title></head>

soup.head
#<head><title>page title</title></head>

soup.body.p
#<p id=quot;firstparaquot; align=quot;centerquot;>This is
paragraph<b>one</b></p>
(Cont.)
• parent         (go to parent node)

    soup.title.parent == soup.head

• next             (go to next node)

    soup.title.next == ‘page title’
    soup.title.next.next == soup.body

• previous     (go to previous node)

    soup.title.previous == soup.head
    sopu.body.p.previous == ‘first paragraph’
(Cont.)
• contents         (all content nodes)

     soup.html.contents ==
     [soup.html.head , soup.html.body]

• nextSibling      (go to next sibling)

     soup.html.body.p.nextSibling
     == soup.html.body.contents[1]

• previousSibling (previous sibling)
     soup.html.body.previousSibling
     == soup.html.head
(Cont.)
• tag
    soup.html.body.name == ‘body’

•
    soup.html.head.title.string
    == str(soup.html.head.title)
    == soup.html.title.head.contents[0]
    == ‘page title’

• Tag
    soup.html.body.p.attrMap
    == {'align' : 'center', 'id' : 'firstpara'}

    soup.html.body.p[‘id’] == 'firstpara'
• find(name, attrs, recursive, text)
• find(name, attrs, recursive, text)
             tag
tag


• find(name, attrs, recursive, text)
             tag
tag


• find(name, attrs, recursive, text)
             tag
tag                tag


• find(name, attrs, recursive, text)
             tag
find(name, attrs, recursive, text)



• soup.find(‘p’)
   #<p id=quot;firstparaquot; align=quot;centerquot;>
   This is paragraph<b>one</b></p>
find(name, attrs, recursive, text)


soup.find(‘p’) == soup.html.body.p

soup.find(‘p’,id=‘secondpara’)
  #<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>



soup.find(‘p’,recuresive=False)==None

soup.find(text=‘one’)==soup.b.next
findAll(name, attrs, recursive, text,limit)

soup.findAll(‘p’) == [soup.html.body.p
                     ,soup.p.nextSibling

soup.findAll(‘p’,id=‘secondpara’)
  #[<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>]



soup.findAll(‘p’,recuresive=False)==[]

soup.findAll(text=‘one’)==soup.b.next

soup.findAll(limit=4)
==[soup.html , soup.html.body
   ,soup.html.body.title , soup.html.body]
Other solutions
• lxml
• html5lib
• HTMLParser
• htmlfill
• Genshi
  http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Reference
• Python Official Website
  http://www.python.com/ (>///<               )
  http://www.python.org/


• Beautiful Soup documentation
  http://www.crummy.com/software/BeautifulSoup/


• personal blog
  http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/


• Python html parser performance
  http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

More Related Content

What's hot

bioinfolec_5th_20070713
bioinfolec_5th_20070713bioinfolec_5th_20070713
bioinfolec_5th_20070713sesejun
 
メタプログラミング入門
メタプログラミング入門メタプログラミング入門
メタプログラミング入門Kent Ohashi
 
Ruby nooks & crannies
Ruby nooks & cranniesRuby nooks & crannies
Ruby nooks & cranniesKerry Buckley
 
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)Ahmed@3604
 
Secrets of a Low Carb Diet
Secrets of a Low Carb DietSecrets of a Low Carb Diet
Secrets of a Low Carb DietHtml Rell
 

What's hot (9)

bioinfolec_5th_20070713
bioinfolec_5th_20070713bioinfolec_5th_20070713
bioinfolec_5th_20070713
 
メタプログラミング入門
メタプログラミング入門メタプログラミング入門
メタプログラミング入門
 
Pr 1
Pr 1Pr 1
Pr 1
 
Ruby nooks & crannies
Ruby nooks & cranniesRuby nooks & crannies
Ruby nooks & crannies
 
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
 
New text document
New text documentNew text document
New text document
 
Secrets of a Low Carb Diet
Secrets of a Low Carb DietSecrets of a Low Carb Diet
Secrets of a Low Carb Diet
 
Five
FiveFive
Five
 
cosc 281 hw3
cosc 281 hw3cosc 281 hw3
cosc 281 hw3
 

Viewers also liked

Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 
東京電機大学 ポータルサイト UNIPAからの情報抽出と再利用
東京電機大学 ポータルサイトUNIPAからの情報抽出と再利用東京電機大学 ポータルサイトUNIPAからの情報抽出と再利用
東京電機大学 ポータルサイト UNIPAからの情報抽出と再利用Koki Hashimoto
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping TechnologiesKrishna Sunuwar
 
ログ分析のある生活(概要編)
ログ分析のある生活(概要編)ログ分析のある生活(概要編)
ログ分析のある生活(概要編)Masakazu Kishima
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from DataMosky Liu
 
ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術Drecom Co., Ltd.
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化logics-of-blue
 

Viewers also liked (20)

Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
東京電機大学 ポータルサイト UNIPAからの情報抽出と再利用
東京電機大学 ポータルサイトUNIPAからの情報抽出と再利用東京電機大学 ポータルサイトUNIPAからの情報抽出と再利用
東京電機大学 ポータルサイト UNIPAからの情報抽出と再利用
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
ログ分析のある生活(概要編)
ログ分析のある生活(概要編)ログ分析のある生活(概要編)
ログ分析のある生活(概要編)
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術
 
Pyladies Tokyo meet up #6
Pyladies Tokyo meet up #6Pyladies Tokyo meet up #6
Pyladies Tokyo meet up #6
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化
 

Similar to Parse The Web Using Python+Beautiful Soup

Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de RailsFabio Akita
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And PortKeiichi Daiba
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And PortKeiichi Daiba
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In PerlKang-min Liu
 
Writing Modular Command-line Apps with App::Cmd
Writing Modular Command-line Apps with App::CmdWriting Modular Command-line Apps with App::Cmd
Writing Modular Command-line Apps with App::CmdRicardo Signes
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertextfrankieroberto
 
Why Python by Marilyn Davis, Marakana
Why Python by Marilyn Davis, MarakanaWhy Python by Marilyn Davis, Marakana
Why Python by Marilyn Davis, MarakanaMarko Gargenta
 
Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin IGuixing Bai
 
A3 sec -_regular_expressions
A3 sec -_regular_expressionsA3 sec -_regular_expressions
A3 sec -_regular_expressionsa3sec
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With PhpJeremy Coates
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Michael Wales
 
SWP - A Generic Language Parser
SWP - A Generic Language ParserSWP - A Generic Language Parser
SWP - A Generic Language Parserkamaelian
 
Intro python
Intro pythonIntro python
Intro pythonkamzilla
 
What's new in Rails 2?
What's new in Rails 2?What's new in Rails 2?
What's new in Rails 2?brynary
 
Graph Databases
Graph DatabasesGraph Databases
Graph DatabasesJosh Adell
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallySean Cribbs
 
Meetup django common_problems(1)
Meetup django common_problems(1)Meetup django common_problems(1)
Meetup django common_problems(1)Eric Satterwhite
 

Similar to Parse The Web Using Python+Beautiful Soup (20)

Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de Rails
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And Port
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
Writing Modular Command-line Apps with App::Cmd
Writing Modular Command-line Apps with App::CmdWriting Modular Command-line Apps with App::Cmd
Writing Modular Command-line Apps with App::Cmd
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Why Python by Marilyn Davis, Marakana
Why Python by Marilyn Davis, MarakanaWhy Python by Marilyn Davis, Marakana
Why Python by Marilyn Davis, Marakana
 
Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin I
 
A3 sec -_regular_expressions
A3 sec -_regular_expressionsA3 sec -_regular_expressions
A3 sec -_regular_expressions
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
 
SWP - A Generic Language Parser
SWP - A Generic Language ParserSWP - A Generic Language Parser
SWP - A Generic Language Parser
 
Intro python
Intro pythonIntro python
Intro python
 
What's new in Rails 2?
What's new in Rails 2?What's new in Rails 2?
What's new in Rails 2?
 
Ae internals
Ae internalsAe internals
Ae internals
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing Functionally
 
Meetup django common_problems(1)
Meetup django common_problems(1)Meetup django common_problems(1)
Meetup django common_problems(1)
 
Ruby 1.9
Ruby 1.9Ruby 1.9
Ruby 1.9
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Parse The Web Using Python+Beautiful Soup

  • 1. Parse the web using Python + Beautiful Soup at ncucc cwebb(dot)tw(at)gmail(dot)com
  • 3. Parse the web? but how?
  • 4. Solutions • C++ • Java • Perl • Python • Others?
  • 5. Solutions (Cont.) • • Regular expression • Parser
  • 9. Python • high-level programming language • scripting language • Google
  • 10. • • {} • list tuple dictionary
  • 11. list • a=[‘asdf’,123,12.01,‘abcd’] • a[3] (a[-1]) • 12.01 • a[0:2] (a[:2]) • [‘asdf’,123,12.01] • b=[‘asdf’,123,[‘qwer’,12.34]]
  • 12. list (Cont.) • a=[‘abc’,12] • len(a) • #2 • a.append(1) • #[‘abc’,12,1] • a.insert(1,‘def’) • #[‘abc’,‘def’,12,1]
  • 13. list (Cont.) • a= [321,456,12,1] • a.pop() • #[321,456,12] • a.index(12) • #2 • a.sort() • #1,12,321,456]
  • 14. tuple • a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01 • a=((‘abc’,1),123.1) • a,b=1,2
  • 16. if else if a>10: print ‘a>10’ elif a<5: print ‘a<5’ else: print ‘5<a<10
  • 17. while loop while a>2 or a<3: pass
  • 18. for loop a=[‘abc’,123,‘def’] abc for x in a: 123 print x def 0 for x in range(3): 1 print x 2 4 for x in range(4,34,10): 14 print x 24
  • 19. function def fib(n): if n==0 or n==1: return n else: return fib(n-1),fib(n-2)
  • 20. ....
  • 21. What is Beautiful Soup not Beautiful Soap • python module • html/xml parser • html/xml •
  • 22. Beautiful Soup <html> <head> <title> page title </title> </head> <body> <p id=quot;firstparaquot; align=quot;centerquot;> first paragraph <b> one </b> </p> <p id=quot;secondparaquot; align=quot;blahquot;> second paragraph <b> two </b> </p> </body> </html>
  • 23. check urllib/urllib2 to see how to open a url in python from BeautifulSoup import BeautifulSoup soup=BeautifulSoup(page) soup.html.head #<head><title>page title</title></head> soup.head #<head><title>page title</title></head> soup.body.p #<p id=quot;firstparaquot; align=quot;centerquot;>This is paragraph<b>one</b></p>
  • 24. (Cont.) • parent (go to parent node) soup.title.parent == soup.head • next (go to next node) soup.title.next == ‘page title’ soup.title.next.next == soup.body • previous (go to previous node) soup.title.previous == soup.head sopu.body.p.previous == ‘first paragraph’
  • 25. (Cont.) • contents (all content nodes) soup.html.contents == [soup.html.head , soup.html.body] • nextSibling (go to next sibling) soup.html.body.p.nextSibling == soup.html.body.contents[1] • previousSibling (previous sibling) soup.html.body.previousSibling == soup.html.head
  • 26. (Cont.) • tag soup.html.body.name == ‘body’ • soup.html.head.title.string == str(soup.html.head.title) == soup.html.title.head.contents[0] == ‘page title’ • Tag soup.html.body.p.attrMap == {'align' : 'center', 'id' : 'firstpara'} soup.html.body.p[‘id’] == 'firstpara'
  • 27. • find(name, attrs, recursive, text)
  • 28. • find(name, attrs, recursive, text) tag
  • 29. tag • find(name, attrs, recursive, text) tag
  • 30. tag • find(name, attrs, recursive, text) tag
  • 31. tag tag • find(name, attrs, recursive, text) tag
  • 32. find(name, attrs, recursive, text) • soup.find(‘p’) #<p id=quot;firstparaquot; align=quot;centerquot;> This is paragraph<b>one</b></p>
  • 33. find(name, attrs, recursive, text) soup.find(‘p’) == soup.html.body.p soup.find(‘p’,id=‘secondpara’) #<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p> soup.find(‘p’,recuresive=False)==None soup.find(text=‘one’)==soup.b.next
  • 34. findAll(name, attrs, recursive, text,limit) soup.findAll(‘p’) == [soup.html.body.p ,soup.p.nextSibling soup.findAll(‘p’,id=‘secondpara’) #[<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>] soup.findAll(‘p’,recuresive=False)==[] soup.findAll(text=‘one’)==soup.b.next soup.findAll(limit=4) ==[soup.html , soup.html.body ,soup.html.body.title , soup.html.body]
  • 35. Other solutions • lxml • html5lib • HTMLParser • htmlfill • Genshi http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
  • 36. Reference • Python Official Website http://www.python.com/ (>///< ) http://www.python.org/ • Beautiful Soup documentation http://www.crummy.com/software/BeautifulSoup/ • personal blog http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/ • Python html parser performance http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/