SlideShare a Scribd company logo
1 of 33
Regular Expressions
Powerful string validation and extraction
Ignaz Wanders – Architect @ Archimiddle
@ignazw
Topics
• What are regular expressions?
• Patterns
• Character classes
• Quantifiers
• Capturing groups
• Boundaries
• Internationalization
• Regular expressions in Java
• Quiz
• References
What are regular expressions?
• A regex is a string pattern used to search and manipulate text
• A regex has special syntax
• Very powerful for any type of String manipulation ranging from simple to very
complex structures:
– Input validation
– S(ubs)tring replacement
– ...
• Example:
• [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
History
• Originates from automata and formal-language theories of computer science
• Stephen Kleene  50’s: Kleene algebra
• Kenneth Thompson  1969: unix: qed, ed
• 70’s - 90’s: unix: grep, awk, sed, emacs
• Programming languages:
– C, Perl
– JavaScript, Java
Patterns
• Regex is based on pattern matching: Strings are searched for certain patterns
• Simplest regex is a string-literal pattern
• Metacharacters: ([{^$|)?*+.
– Period means “any character”
– To search for period as string literal, escape with “”
REGEX: fox
TEXT: The quick brown fox
RESULT: fox
REGEX: fo.
TEXT: The quick brown fox
RESULT: fox
REGEX: .o.
TEXT: The quick brown fox
RESULT: row, fox
Character classes (1/3)
• Syntax: any characters between [ and ]
• Character classes denote one letter
• Negation: ^
REGEX: [rcb]at
TEXT: bat
RESULT: bat
REGEX: [rcb]at
TEXT: rat
RESULT: rat
REGEX: [rcb]at
TEXT: cat
RESULT: cat
REGEX: [rcb]at
TEXT: hat
RESULT: -
REGEX: [^rcb]at
TEXT: rat
RESULT: -
REGEX: [^rcb]at
TEXT: hat
RESULT: hat
Character classes (2/3)
• Ranges: [a-z], [0-9], [i-n], [a-zA-Z]...
• Unions: [0-4[6-8]], [a-p[r-w]], ...
• Intersections: [a-f&&[efg]], [a-f&&[e-k]], ...
• Subtractions: [a-f&&[^efg]], ...
REGEX: [rcb]at[1-5]
TEXT: bat4 RESULT: bat4
REGEX: [rcb]at[1-5[7-8]]
TEXT: hat7 RESULT: -
REGEX: [rcb]at[1-7&&[78]]
TEXT: rat7 RESULT: rat7
REGEX: [rcb]at[1-5&&[^34]]
TEXT: bat4 RESULT: -
Character classes (3/3)
predefined character classes equivalence
. any character
d any digit [0-9]
D any non-digit [^0-9], [^d]
s any white-space character [ tnx0Bfr]
S any non-white-space character [^s]
w any word character [a-zA-Z_0-9]
W any non-word character [^w]
Quantifiers (1/5)
• Quantifiers allow character classes to match more than one character at a time.
Quantifiers for character classes X
X? zero or one time
X* zero or more times
X+ one or more times
X{n} exactly n times
X{n,} at least n times
X{n,m} at least n and at most m times
Quantifiers (2/5)
• Examples of X?, X*, X+
REGEX: “a?”
TEXT: “”
RESULT: “”
REGEX: “a*”
TEXT: “”
RESULT: “”
REGEX: “a+”
TEXT: “”
RESULT: -
REGEX: “a?”
TEXT: “a”
RESULT: “a”
REGEX: “a*”
TEXT: “a”
RESULT: “a”
REGEX: “a+”
TEXT: “a”
RESULT: “a”
REGEX: “a?”
TEXT: “aaa”
RESULT:
“a”,”a”,”a”
REGEX: “a*”
TEXT: “aaa”
RESULT: “aaa”
REGEX: “a+”
TEXT: “aaa”
RESULT: “aaa”
Quantifiers (3/5)
REGEX: “[abc]{3}”
TEXT: “abccabaaaccbbbc”
RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc”
REGEX: “abc{3}”
TEXT: “abccabaaaccbbbc”
RESULT: -
REGEX: “(dog){3}”
TEXT: “dogdogdogdogdogdog”
RESULT: “dogdogdog”,”dogdogdog”
Quantifiers (4/5)
• Greedy quantifiers:
– read complete string
– work backwards until match found
– syntax: X?, X*, X+, ...
• Reluctant quantifiers:
– read one character at a time
– work forward until match found
– syntax: X??, X*?, X+?, ...
• Possessive quantifiers:
– read complete string
– try match only once
– syntax: X?+, X*+, X++, ...
Quantifiers (5/5)
REGEX: “.*foo”
TEXT: “xfooxxxxxxfoo”
RESULT: “xfooxxxxxxfoo”
REGEX: .*?foo”
TEXT: “xfooxxxxxxfoo”
RESULT: “xfoo”, “xxxxxxfoo”
REGEX: “.*+foo”
TEXT: “xfooxxxxxxfoo”
RESULT: -
greedy
reluctant
possessive
Capturing groups (1/2)
• Capturing groups treat multiple characters as a single unit
• Syntax: between braces ( and )
• Example: (dog){3}
• Numbering from left to right
– Example: ((A)(B(C)))
• Group 1: ((A)(B(C)))
• Group 2: (A)
• Group 3: (B(C))
• Group 4: (C)
Capturing groups (2/2)
• Backreferences to capturing groups are denoted by i with i an integer number
REGEX: “(dd)1”
TEXT: “1212”
RESULT: “1212”
REGEX: “(dd)1”
TEXT: “1234”
RESULT: -
Boundaries (1/2)
Boundary characters
^ beginning of line
$ end of line
b a word boundary
B a non-word boundary
A beginning of input
G end of previous match
z end of input
Z end of input, but before final terminator, if any
Boundaries (2/2)
• Be aware:
• End-of-line marker is $
– Unix EOL is n
– Windows EOL is rn
– JDK uses any of the following as EOL:
• 'n', 'rn', 'u0085', 'u2028', 'u2029'
• Always test your regular expressions on the target OS
Internationalization (1/2)
• Regular expressions originally designed for the ascii Basic Latin set of characters.
– Thus “België” is not matched by ^w+$
• Extension to unicode character sets denoted by p{...}
• Character set: [p{InCharacterSet}]
– Create character classes from symbols in character sets.
– “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
Internationalization (2/2)
• Note that there are non-letters in character sets as well:
– Latin-1 Supplement:
• Categories:
– Letters: p{L}
– Uppercase letters: p{Lu}
– “België” is matched by ^p{L}+$
• Other (POSIX) categories:
– Unicode currency symbols: p{Sc}
– ASCII punctuation characters: p{Punct}
¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
Regular expressions in Java
• Since JDK 1.4
• Package java.util.regex
– Pattern class
– Matcher class
• Convenience methods in java.lang.String
• Alternative for JDK 1.3
– Jakarta ORO project
java.util.regex.Pattern
• Wrapper class for regular expressions
• Useful methods:
– compile(String regex): Pattern
– matches(String regex, CharSequence text): boolean
– split(String text): String[]
String regex = “(dd)1”;
Pattern p = Pattern.compile(regex);
java.util.regex.Matcher
• Useful methods:
– matches(): boolean
– find(): boolean
– find(int start): boolean
– group(): String
– replaceFirst(String replace): String
– replaceAll(String replace): String
String regex = “(dd)1”;
Pattern p = Pattern.compile(regex);
String text = “1212”;
Matcher m = p.matcher(text);
boolean matches = m.matches();
java.lang.String
• Pattern and Matcher methods in String:
– matches(String regex): boolean
– split(String regex): String[]
– replaceFirst(String regex, String replace): String
– replaceAll(String regex, String replace): String
Examples
• Validation
• Searching text
• Filtering
• Parsing
• Removing duplicate lines
• On-the-fly editing
Examples: validation
• Validate an e-mail address
• A URL
[A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
(http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7}
|localw*)(:d+)?(/(w+[w/-.]*)?)?
Examples: searching text
• Write HttpUnit test to submit HTML form and check whether HTTP response is a
confirmation screen containing a generated form number of the form 9xxxxxx-
xxxxxx:
9[0-9]{6}-[0-9]{6}
Pattern p = Pattern.compile(regexp);
Matcher m = p.matcher(text);
boolean ok = m.find();
String nr = m.group();
Examples: filtering
• Filter e-mail with subjects with capitals only, and including a leading “Re:”
(R[eE]:)*[^a-z]*$
Examples: parsing
• Matches any opening and closing XML tag:
– Note the use of the back reference
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
Examples: duplicate lines
• Suppose you want to remove duplicate lines from a text.
– requirement here is that the lines are sorted alphabetically
^(.*)(r?n1)+$
Examples: on-the-fly editing
• Suppose you want to edit a file in batch: all occurrances of a certain string pattern
should be replaced with another string.
• In unix: use the sed command with a regex
• In Java: use string.replaceAll(regex,”mystring”)
• In Ant: use replaceregexp optional task to, e.g., edit deployment descriptors
depending on environment
Quiz
• What are the following regular expressions looking for?
d+ at least one digit
[-+]?d+ any integer
((d*.?)?d+|d+(.?d*)) any positive decimal
[p{L}']['-.p{L} ]+ a place name
Conclusion
• When doing one of the following:
– validating strings
– on-the-fly editing of strings
– searching strings
– filtering strings
• think regex!
References
• http://www.regular-expressions.info/
• http://www.regexlib.com/
• http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/
• http://java.sun.com/docs/books/tutorial/extra/regex/
• http://www.wellho.net/regex/javare.html
• >JDK 1.4 API
• Mastering Regular Expressions

More Related Content

What's hot

oracle Sql constraint
oracle  Sql constraint oracle  Sql constraint
oracle Sql constraint home
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expressionvaluebound
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsDanny Bryant
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsRaj Gupta
 
Regex Presentation
Regex PresentationRegex Presentation
Regex Presentationarnolambert
 
Including Constraints -Oracle Data base
Including Constraints -Oracle Data base Including Constraints -Oracle Data base
Including Constraints -Oracle Data base Salman Memon
 
Regular expression
Regular expressionRegular expression
Regular expressionLarry Nung
 
Introduction to fa and dfa
Introduction to fa  and dfaIntroduction to fa  and dfa
Introduction to fa and dfadeepinderbedi
 
Operators and expressions in C++
Operators and expressions in C++Operators and expressions in C++
Operators and expressions in C++Neeru Mittal
 
Java Web Programming [1/9] : Introduction to Web Application
Java Web Programming [1/9] : Introduction to Web ApplicationJava Web Programming [1/9] : Introduction to Web Application
Java Web Programming [1/9] : Introduction to Web ApplicationIMC Institute
 
Array in c language
Array in c languageArray in c language
Array in c languagehome
 
Regular Expression
Regular ExpressionRegular Expression
Regular ExpressionLambert Lum
 
Loop(for, while, do while) condition Presentation
Loop(for, while, do while) condition PresentationLoop(for, while, do while) condition Presentation
Loop(for, while, do while) condition PresentationBadrul Alam
 
Hashing and Hash Tables
Hashing and Hash TablesHashing and Hash Tables
Hashing and Hash Tablesadil raja
 
Regular expressions
Regular expressionsRegular expressions
Regular expressionsShiraz316
 

What's hot (20)

Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
oracle Sql constraint
oracle  Sql constraint oracle  Sql constraint
oracle Sql constraint
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Regular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular ExpressionsRegular Expressions 101 Introduction to Regular Expressions
Regular Expressions 101 Introduction to Regular Expressions
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regex Presentation
Regex PresentationRegex Presentation
Regex Presentation
 
Including Constraints -Oracle Data base
Including Constraints -Oracle Data base Including Constraints -Oracle Data base
Including Constraints -Oracle Data base
 
Regular expression
Regular expressionRegular expression
Regular expression
 
Break and continue
Break and continueBreak and continue
Break and continue
 
Pointers
PointersPointers
Pointers
 
Introduction to fa and dfa
Introduction to fa  and dfaIntroduction to fa  and dfa
Introduction to fa and dfa
 
Operators and expressions in C++
Operators and expressions in C++Operators and expressions in C++
Operators and expressions in C++
 
Java Web Programming [1/9] : Introduction to Web Application
Java Web Programming [1/9] : Introduction to Web ApplicationJava Web Programming [1/9] : Introduction to Web Application
Java Web Programming [1/9] : Introduction to Web Application
 
Introduction to c#
Introduction to c#Introduction to c#
Introduction to c#
 
Array in c language
Array in c languageArray in c language
Array in c language
 
C Structures And Unions
C  Structures And  UnionsC  Structures And  Unions
C Structures And Unions
 
Regular Expression
Regular ExpressionRegular Expression
Regular Expression
 
Loop(for, while, do while) condition Presentation
Loop(for, while, do while) condition PresentationLoop(for, while, do while) condition Presentation
Loop(for, while, do while) condition Presentation
 
Hashing and Hash Tables
Hashing and Hash TablesHashing and Hash Tables
Hashing and Hash Tables
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 

Viewers also liked

Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesMarina Santini
 
Learn PHP Lacture1
Learn PHP Lacture1Learn PHP Lacture1
Learn PHP Lacture1ADARSH BHATT
 
Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondMax Shirshin
 
Introduction to regular expressions
Introduction to regular expressionsIntroduction to regular expressions
Introduction to regular expressionsBen Brumfield
 
Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?Ignaz Wanders
 
The Service doing "Ping"
The Service doing "Ping"The Service doing "Ping"
The Service doing "Ping"Ignaz Wanders
 
Web Service Versioning
Web Service VersioningWeb Service Versioning
Web Service VersioningIgnaz Wanders
 
Lecture 03 lexical analysis
Lecture 03 lexical analysisLecture 03 lexical analysis
Lecture 03 lexical analysisIffat Anjum
 
Finite Automata
Finite AutomataFinite Automata
Finite AutomataShiraz316
 
Regular expression with DFA
Regular expression with DFARegular expression with DFA
Regular expression with DFAMaulik Togadiya
 
Field Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your BuddyField Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your BuddyMichael Wilde
 

Viewers also liked (17)

Lecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular LanguagesLecture: Regular Expressions and Regular Languages
Lecture: Regular Expressions and Regular Languages
 
Regular expression (compiler)
Regular expression (compiler)Regular expression (compiler)
Regular expression (compiler)
 
Learn PHP Lacture1
Learn PHP Lacture1Learn PHP Lacture1
Learn PHP Lacture1
 
Regular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And BeyondRegular Expressions: JavaScript And Beyond
Regular Expressions: JavaScript And Beyond
 
Introduction to regular expressions
Introduction to regular expressionsIntroduction to regular expressions
Introduction to regular expressions
 
Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?Bitcoin: the future money, or a scam?
Bitcoin: the future money, or a scam?
 
The Service doing "Ping"
The Service doing "Ping"The Service doing "Ping"
The Service doing "Ping"
 
Reflexive Access List
Reflexive Access ListReflexive Access List
Reflexive Access List
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Tests
TestsTests
Tests
 
Regular expression examples
Regular expression examplesRegular expression examples
Regular expression examples
 
Lecture2 B
Lecture2 BLecture2 B
Lecture2 B
 
Web Service Versioning
Web Service VersioningWeb Service Versioning
Web Service Versioning
 
Lecture 03 lexical analysis
Lecture 03 lexical analysisLecture 03 lexical analysis
Lecture 03 lexical analysis
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
Regular expression with DFA
Regular expression with DFARegular expression with DFA
Regular expression with DFA
 
Field Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your BuddyField Extractions: Making Regex Your Buddy
Field Extractions: Making Regex Your Buddy
 

Similar to Regular expressions

Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)Svetlin Nakov
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and YouJames Armes
 
Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Ben Brumfield
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeProf. Wim Van Criekinge
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Ahmed El-Arabawy
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrepTri Truong
 
Regular expressions-ada-2018
Regular expressions-ada-2018Regular expressions-ada-2018
Regular expressions-ada-2018Emma Burrows
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfBryan Alejos
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot CampChris Schiffhauer
 
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeWeek-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeBertram Ludäscher
 
Js reg正则表达式
Js reg正则表达式Js reg正则表达式
Js reg正则表达式keke302
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular ExpressionsJesse Anderson
 

Similar to Regular expressions (20)

Regular expression for everyone
Regular expression for everyoneRegular expression for everyone
Regular expression for everyone
 
Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)Regular Expressions: QA Challenge Accepted Conf (March 2015)
Regular Expressions: QA Challenge Accepted Conf (March 2015)
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
 
Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Bioinformatica p2-p3-introduction
Bioinformatica p2-p3-introductionBioinformatica p2-p3-introduction
Bioinformatica p2-p3-introduction
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
 
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 13: Regular Expressions
 
JavaScript.pptx
JavaScript.pptxJavaScript.pptx
JavaScript.pptx
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Json the-x-in-ajax1588
Json the-x-in-ajax1588Json the-x-in-ajax1588
Json the-x-in-ajax1588
 
Regular expressions-ada-2018
Regular expressions-ada-2018Regular expressions-ada-2018
Regular expressions-ada-2018
 
Json demo
Json demoJson demo
Json demo
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
 
Regular Expressions Boot Camp
Regular Expressions Boot CampRegular Expressions Boot Camp
Regular Expressions Boot Camp
 
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in PracticeWeek-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice
 
Js reg正则表达式
Js reg正则表达式Js reg正则表达式
Js reg正则表达式
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Quick start reg ex
Quick start reg exQuick start reg ex
Quick start reg ex
 

Recently uploaded

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Regular expressions

  • 1. Regular Expressions Powerful string validation and extraction Ignaz Wanders – Architect @ Archimiddle @ignazw
  • 2. Topics • What are regular expressions? • Patterns • Character classes • Quantifiers • Capturing groups • Boundaries • Internationalization • Regular expressions in Java • Quiz • References
  • 3. What are regular expressions? • A regex is a string pattern used to search and manipulate text • A regex has special syntax • Very powerful for any type of String manipulation ranging from simple to very complex structures: – Input validation – S(ubs)tring replacement – ... • Example: • [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4}
  • 4. History • Originates from automata and formal-language theories of computer science • Stephen Kleene  50’s: Kleene algebra • Kenneth Thompson  1969: unix: qed, ed • 70’s - 90’s: unix: grep, awk, sed, emacs • Programming languages: – C, Perl – JavaScript, Java
  • 5. Patterns • Regex is based on pattern matching: Strings are searched for certain patterns • Simplest regex is a string-literal pattern • Metacharacters: ([{^$|)?*+. – Period means “any character” – To search for period as string literal, escape with “” REGEX: fox TEXT: The quick brown fox RESULT: fox REGEX: fo. TEXT: The quick brown fox RESULT: fox REGEX: .o. TEXT: The quick brown fox RESULT: row, fox
  • 6. Character classes (1/3) • Syntax: any characters between [ and ] • Character classes denote one letter • Negation: ^ REGEX: [rcb]at TEXT: bat RESULT: bat REGEX: [rcb]at TEXT: rat RESULT: rat REGEX: [rcb]at TEXT: cat RESULT: cat REGEX: [rcb]at TEXT: hat RESULT: - REGEX: [^rcb]at TEXT: rat RESULT: - REGEX: [^rcb]at TEXT: hat RESULT: hat
  • 7. Character classes (2/3) • Ranges: [a-z], [0-9], [i-n], [a-zA-Z]... • Unions: [0-4[6-8]], [a-p[r-w]], ... • Intersections: [a-f&&[efg]], [a-f&&[e-k]], ... • Subtractions: [a-f&&[^efg]], ... REGEX: [rcb]at[1-5] TEXT: bat4 RESULT: bat4 REGEX: [rcb]at[1-5[7-8]] TEXT: hat7 RESULT: - REGEX: [rcb]at[1-7&&[78]] TEXT: rat7 RESULT: rat7 REGEX: [rcb]at[1-5&&[^34]] TEXT: bat4 RESULT: -
  • 8. Character classes (3/3) predefined character classes equivalence . any character d any digit [0-9] D any non-digit [^0-9], [^d] s any white-space character [ tnx0Bfr] S any non-white-space character [^s] w any word character [a-zA-Z_0-9] W any non-word character [^w]
  • 9. Quantifiers (1/5) • Quantifiers allow character classes to match more than one character at a time. Quantifiers for character classes X X? zero or one time X* zero or more times X+ one or more times X{n} exactly n times X{n,} at least n times X{n,m} at least n and at most m times
  • 10. Quantifiers (2/5) • Examples of X?, X*, X+ REGEX: “a?” TEXT: “” RESULT: “” REGEX: “a*” TEXT: “” RESULT: “” REGEX: “a+” TEXT: “” RESULT: - REGEX: “a?” TEXT: “a” RESULT: “a” REGEX: “a*” TEXT: “a” RESULT: “a” REGEX: “a+” TEXT: “a” RESULT: “a” REGEX: “a?” TEXT: “aaa” RESULT: “a”,”a”,”a” REGEX: “a*” TEXT: “aaa” RESULT: “aaa” REGEX: “a+” TEXT: “aaa” RESULT: “aaa”
  • 11. Quantifiers (3/5) REGEX: “[abc]{3}” TEXT: “abccabaaaccbbbc” RESULT: “abc”,”cab”,”aaa”,”ccb”,”bbc” REGEX: “abc{3}” TEXT: “abccabaaaccbbbc” RESULT: - REGEX: “(dog){3}” TEXT: “dogdogdogdogdogdog” RESULT: “dogdogdog”,”dogdogdog”
  • 12. Quantifiers (4/5) • Greedy quantifiers: – read complete string – work backwards until match found – syntax: X?, X*, X+, ... • Reluctant quantifiers: – read one character at a time – work forward until match found – syntax: X??, X*?, X+?, ... • Possessive quantifiers: – read complete string – try match only once – syntax: X?+, X*+, X++, ...
  • 13. Quantifiers (5/5) REGEX: “.*foo” TEXT: “xfooxxxxxxfoo” RESULT: “xfooxxxxxxfoo” REGEX: .*?foo” TEXT: “xfooxxxxxxfoo” RESULT: “xfoo”, “xxxxxxfoo” REGEX: “.*+foo” TEXT: “xfooxxxxxxfoo” RESULT: - greedy reluctant possessive
  • 14. Capturing groups (1/2) • Capturing groups treat multiple characters as a single unit • Syntax: between braces ( and ) • Example: (dog){3} • Numbering from left to right – Example: ((A)(B(C))) • Group 1: ((A)(B(C))) • Group 2: (A) • Group 3: (B(C)) • Group 4: (C)
  • 15. Capturing groups (2/2) • Backreferences to capturing groups are denoted by i with i an integer number REGEX: “(dd)1” TEXT: “1212” RESULT: “1212” REGEX: “(dd)1” TEXT: “1234” RESULT: -
  • 16. Boundaries (1/2) Boundary characters ^ beginning of line $ end of line b a word boundary B a non-word boundary A beginning of input G end of previous match z end of input Z end of input, but before final terminator, if any
  • 17. Boundaries (2/2) • Be aware: • End-of-line marker is $ – Unix EOL is n – Windows EOL is rn – JDK uses any of the following as EOL: • 'n', 'rn', 'u0085', 'u2028', 'u2029' • Always test your regular expressions on the target OS
  • 18. Internationalization (1/2) • Regular expressions originally designed for the ascii Basic Latin set of characters. – Thus “België” is not matched by ^w+$ • Extension to unicode character sets denoted by p{...} • Character set: [p{InCharacterSet}] – Create character classes from symbols in character sets. – “België” is matched by ^*w|[p{InLatin-1Supplement}]]+$
  • 19. Internationalization (2/2) • Note that there are non-letters in character sets as well: – Latin-1 Supplement: • Categories: – Letters: p{L} – Uppercase letters: p{Lu} – “België” is matched by ^p{L}+$ • Other (POSIX) categories: – Unicode currency symbols: p{Sc} – ASCII punctuation characters: p{Punct} ¡¢£¤¥¦§¨©ª«-®¯°±²³´µ·¸¹º»¼½¾¿÷
  • 20. Regular expressions in Java • Since JDK 1.4 • Package java.util.regex – Pattern class – Matcher class • Convenience methods in java.lang.String • Alternative for JDK 1.3 – Jakarta ORO project
  • 21. java.util.regex.Pattern • Wrapper class for regular expressions • Useful methods: – compile(String regex): Pattern – matches(String regex, CharSequence text): boolean – split(String text): String[] String regex = “(dd)1”; Pattern p = Pattern.compile(regex);
  • 22. java.util.regex.Matcher • Useful methods: – matches(): boolean – find(): boolean – find(int start): boolean – group(): String – replaceFirst(String replace): String – replaceAll(String replace): String String regex = “(dd)1”; Pattern p = Pattern.compile(regex); String text = “1212”; Matcher m = p.matcher(text); boolean matches = m.matches();
  • 23. java.lang.String • Pattern and Matcher methods in String: – matches(String regex): boolean – split(String regex): String[] – replaceFirst(String regex, String replace): String – replaceAll(String regex, String replace): String
  • 24. Examples • Validation • Searching text • Filtering • Parsing • Removing duplicate lines • On-the-fly editing
  • 25. Examples: validation • Validate an e-mail address • A URL [A-Z0-9._%-]+@[A-Z0-9._%-]+.[A-Z0-9._%-]{2,4} (http|https|ftp)://([a-zA-Z0-9](w+.)+w{2,7} |localw*)(:d+)?(/(w+[w/-.]*)?)?
  • 26. Examples: searching text • Write HttpUnit test to submit HTML form and check whether HTTP response is a confirmation screen containing a generated form number of the form 9xxxxxx- xxxxxx: 9[0-9]{6}-[0-9]{6} Pattern p = Pattern.compile(regexp); Matcher m = p.matcher(text); boolean ok = m.find(); String nr = m.group();
  • 27. Examples: filtering • Filter e-mail with subjects with capitals only, and including a leading “Re:” (R[eE]:)*[^a-z]*$
  • 28. Examples: parsing • Matches any opening and closing XML tag: – Note the use of the back reference <([A-Z][A-Z0-9]*)[^>]*>(.*?)</1>
  • 29. Examples: duplicate lines • Suppose you want to remove duplicate lines from a text. – requirement here is that the lines are sorted alphabetically ^(.*)(r?n1)+$
  • 30. Examples: on-the-fly editing • Suppose you want to edit a file in batch: all occurrances of a certain string pattern should be replaced with another string. • In unix: use the sed command with a regex • In Java: use string.replaceAll(regex,”mystring”) • In Ant: use replaceregexp optional task to, e.g., edit deployment descriptors depending on environment
  • 31. Quiz • What are the following regular expressions looking for? d+ at least one digit [-+]?d+ any integer ((d*.?)?d+|d+(.?d*)) any positive decimal [p{L}']['-.p{L} ]+ a place name
  • 32. Conclusion • When doing one of the following: – validating strings – on-the-fly editing of strings – searching strings – filtering strings • think regex!
  • 33. References • http://www.regular-expressions.info/ • http://www.regexlib.com/ • http://developer.java.sun.com/developer/technicalArticles/releases/1.4regex/ • http://java.sun.com/docs/books/tutorial/extra/regex/ • http://www.wellho.net/regex/javare.html • >JDK 1.4 API • Mastering Regular Expressions