SlideShare ist ein Scribd-Unternehmen logo
1 von 20
REGULAR EXPRESSIONS, EXTRAORDINARY POWER
UNSL
2013
Burdisso Sergio - sergio.burdisso@gmail.com
 I have 20 min to cover all about using REs on
theW3
 HTTP
 Internet bots
 Web Crawler
 Web Scraping
HyperText Transfer Protocol
WWW (The Web)
Web Browser
Request
Response
HTTP
HTTP
 Application layer protocol
 HTTP is the protocol to exchange or transfer hypertext
Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html
sequences of characters
 HTTP Response example
Header
Body
EXTRAORDINARY POWER
 FirstThings First…
Regular ExpressionsAre
Awesome!
 Gather text
 Replace /Transform text
 Search /Validate text
 POSIX regular expressions (standard)
▪ ^. [ ] [^ ] (0) * {m,n} ? +|$
 regex.h
 pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"
 regcomp(regex_t *regex, pattern, cflags);
 regex.re_nsub = 4 //Number of parenthesized subexpressions
 regexec(regex, text, pmatch[])
 pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
 Making use of RE to parse HTTP responses headers
Great! Now we’re able to parse the http response headers… so what?
-We can properly process the response body!
Ah I see! … and what would I do that for?
-Let me show you!
Just like spiders on the web!
Regular Expressions cartoon from xkcd
Web Scraping
(we will see!)
 Internet bots (web robots,WWW robots or
bots) are software applications that run
automated tasks over the Internet
 A Web crawler is an Internet bot that
systematically browses theWorld Wide Web,
typically for the purpose ofWeb indexing
 Web scraping is a computer software technique
of extracting information from websites
 A Web Crawler Starts with a list of URLs to visit. As the
crawler visits these URLs, it identifies all
the hyperlinks in the page and adds them to the list of
URLs to visit
hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)"");
hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
 Web Scraping: A simple yet powerful approach to
extract information from web pages can be based on
regular expression matching facilities of programming
languages (for instance C++, Perl or Python)
Regular Expressions cartoon from xkcd
WebScraping wScraping (8, "http://emails.com/victim");
wScraping.findAll(
"^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box
d{1,5}))s{1,2}(?i:(?<address2>(((APT|B
LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{
1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?<
city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}),
x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL
N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]
|T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$"
);
We’ve saved the day!
Everybody stand back!
We know regular expressions
The end
Thank you for your patience!

Weitere ähnliche Inhalte

Was ist angesagt?

CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaVincent Terrasi
 
Combinators - Lightning Talk
Combinators - Lightning TalkCombinators - Lightning Talk
Combinators - Lightning TalkMike Harris
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方Haruhiko Kobayashi
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudreyAudrey Lim
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started01 ElasticSearch : Getting Started
01 ElasticSearch : Getting StartedOpenThink Labs
 
Reactをproductionに導入して変わったこと
Reactをproductionに導入して変わったことReactをproductionに導入して変わったこと
Reactをproductionに導入して変わったことHaruhiko Kobayashi
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyErin Shellman
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appBruce McPherson
 
Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答Hisashi Komine
 

Was ist angesagt? (20)

CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Combinators - Lightning Talk
Combinators - Lightning TalkCombinators - Lightning Talk
Combinators - Lightning Talk
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方サービスリニューアルしてわかったRailsのReactとの付き合い方
サービスリニューアルしてわかったRailsのReactとの付き合い方
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
Golang slidesaudrey
Golang slidesaudreyGolang slidesaudrey
Golang slidesaudrey
 
hySON - D2Fest
hySON - D2FesthySON - D2Fest
hySON - D2Fest
 
hySON
hySONhySON
hySON
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started01 ElasticSearch : Getting Started
01 ElasticSearch : Getting Started
 
Reactをproductionに導入して変わったこと
Reactをproductionに導入して変わったことReactをproductionに導入して変わったこと
Reactをproductionに導入して変わったこと
 
Downloading the internet with Python + Scrapy
Downloading the internet with Python + ScrapyDownloading the internet with Python + Scrapy
Downloading the internet with Python + Scrapy
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答Document Conversion & Retrieve and Rank 一問一答
Document Conversion & Retrieve and Rank 一問一答
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 

Ähnlich wie regular expressions and the world wide web

Software Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and ProspectsSoftware Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and ProspectsAli Mesbah
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails SiddheshSiddhesh Bhobe
 
Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)Susam Pal
 
Wss Object Model
Wss Object ModelWss Object Model
Wss Object Modelmaddinapudi
 
Writing Secure Code for WordPress
Writing Secure Code for WordPressWriting Secure Code for WordPress
Writing Secure Code for WordPressShawn Hooper
 
C#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalC#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalRich Helton
 
Securing Java EE Web Apps
Securing Java EE Web AppsSecuring Java EE Web Apps
Securing Java EE Web AppsFrank Kim
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regexSteve Mylroie
 
Html intake 38 lect1
Html intake 38 lect1Html intake 38 lect1
Html intake 38 lect1ghkadous
 
Taking AJAX to the Next Level
Taking AJAX to the Next LevelTaking AJAX to the Next Level
Taking AJAX to the Next LevelStephen.Walther
 
Microsoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next LevelMicrosoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next Levelgoodfriday
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web ArchitectureChamnap Chhorn
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharpergranicz
 

Ähnlich wie regular expressions and the world wide web (20)

Senior Project Documentation.
Senior Project Documentation.Senior Project Documentation.
Senior Project Documentation.
 
Software Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and ProspectsSoftware Analysis for the Web: Achievements and Prospects
Software Analysis for the Web: Achievements and Prospects
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails Siddhesh
 
Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)Top 10 Security Vulnerabilities (2006)
Top 10 Security Vulnerabilities (2006)
 
Wss Object Model
Wss Object ModelWss Object Model
Wss Object Model
 
Writing Secure Code for WordPress
Writing Secure Code for WordPressWriting Secure Code for WordPress
Writing Secure Code for WordPress
 
Web services intro.
Web services intro.Web services intro.
Web services intro.
 
Os Pruett
Os PruettOs Pruett
Os Pruett
 
C#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 FinalC#Web Sec Oct27 2010 Final
C#Web Sec Oct27 2010 Final
 
Api
ApiApi
Api
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Mashup
MashupMashup
Mashup
 
Securing Java EE Web Apps
Securing Java EE Web AppsSecuring Java EE Web Apps
Securing Java EE Web Apps
 
Extracting data from text documents using the regex
Extracting data from text documents using the regexExtracting data from text documents using the regex
Extracting data from text documents using the regex
 
Html intake 38 lect1
Html intake 38 lect1Html intake 38 lect1
Html intake 38 lect1
 
Taking AJAX to the Next Level
Taking AJAX to the Next LevelTaking AJAX to the Next Level
Taking AJAX to the Next Level
 
Microsoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next LevelMicrosoft ASP.NET: Taking AJAX to the Next Level
Microsoft ASP.NET: Taking AJAX to the Next Level
 
Introduction to Web Architecture
Introduction to Web ArchitectureIntroduction to Web Architecture
Introduction to Web Architecture
 
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharperGDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharper
 
Switch to Backend 2023
Switch to Backend 2023Switch to Backend 2023
Switch to Backend 2023
 

Kürzlich hochgeladen

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

regular expressions and the world wide web

  • 1. REGULAR EXPRESSIONS, EXTRAORDINARY POWER UNSL 2013 Burdisso Sergio - sergio.burdisso@gmail.com
  • 2.  I have 20 min to cover all about using REs on theW3
  • 3.  HTTP  Internet bots  Web Crawler  Web Scraping
  • 5.
  • 6. WWW (The Web) Web Browser Request Response HTTP HTTP
  • 7.  Application layer protocol  HTTP is the protocol to exchange or transfer hypertext Http documentation: http://www.w3.org/Protocols/rfc2616/rfc2616.html sequences of characters
  • 8.  HTTP Response example Header Body
  • 10.  FirstThings First… Regular ExpressionsAre Awesome!  Gather text  Replace /Transform text  Search /Validate text
  • 11.  POSIX regular expressions (standard) ▪ ^. [ ] [^ ] (0) * {m,n} ? +|$  regex.h  pattern = "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})"  regcomp(regex_t *regex, pattern, cflags);  regex.re_nsub = 4 //Number of parenthesized subexpressions  regexec(regex, text, pmatch[])  pmatch[nsub].rm_so, pmatch[nsub].rm_eo <= 255
  • 12.  Making use of RE to parse HTTP responses headers
  • 13. Great! Now we’re able to parse the http response headers… so what? -We can properly process the response body! Ah I see! … and what would I do that for? -Let me show you!
  • 14. Just like spiders on the web!
  • 15. Regular Expressions cartoon from xkcd Web Scraping (we will see!)
  • 16.  Internet bots (web robots,WWW robots or bots) are software applications that run automated tasks over the Internet  A Web crawler is an Internet bot that systematically browses theWorld Wide Web, typically for the purpose ofWeb indexing  Web scraping is a computer software technique of extracting information from websites
  • 17.  A Web Crawler Starts with a list of URLs to visit. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit hyperlinks0 = getAllLexemes(rsp.Body, "href="((http:)?//([^/rn]*))?(/?[^"]*)""); hyperlinks1= getAllLexemes(rsp.Body, "src="((http:)?//([^/rn]*))?(/?[^"]*)"");
  • 18.  Web Scraping: A simple yet powerful approach to extract information from web pages can be based on regular expression matching facilities of programming languages (for instance C++, Perl or Python)
  • 19. Regular Expressions cartoon from xkcd WebScraping wScraping (8, "http://emails.com/victim"); wScraping.findAll( "^(?n:(?<address1>(d{1,5}( 1/[234])?(x20[A-Z]([a-z])+)+ )|(P.O. Box d{1,5}))s{1,2}(?i:(?<address2>(((APT|B LDG|DEPT|FL|HNGR|LOT|PIER|RM|S(LIP|PC|T(E|OP))|TRLR|UNIT)x20w{ 1,5})|(BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR).?)s{1,2})?)(?< city>[A-Z]([a-z])+(.?)(x20[A-Z]([a-z])+){0,2}), x20(?<state>A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADL N]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD] |T[NX]|UT|V[AIT]|W[AIVY])x20(?<zipcode>(?!0{5})d{5}(-d {4})?))$" ); We’ve saved the day!
  • 20. Everybody stand back! We know regular expressions The end Thank you for your patience!