SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Memorable uses for a
Regular Expression
library
Learning the syntax by examples
Alex Perry
SRE, Google, Los Angeles
April 2014
Outline
● Simple Regular Expressions
● import re
○ http://docs.python.org/2/library/re.html
● Parsing
● import sre
● Formatting
● import sre_yield
● Arithmetic
● Performance uncertainty
● import re2
Basic Regular Expressions
abc “abc”
[abc] “a” “b” “c”
abc? “ab” “abc”
abc* “ab” “abc” “abcc” ...
abc+ “abc” “abcc” “abccc” ...
abc{3,4} “abccc” “abcccc”
ab|c+ “ab” “c+”
ab. “ab.” “ab1” … “abn”DOTALL
The standard library - compiling
>>> import re
>>> o = re.compile(“abc?”)
>>> [bool(o.match(s)) for s in
["a", "ab", "abc", "abcc", "aabcc"]]
[False, True, True, True, False]
>>> [bool(o.search(s)) for s in
["a", "ab", "abc", "abcc", "aabcc"]]
[False, True, True, True, True]
The standard library - endings
>>> o = re.compile("^abc?$")
>>> [bool(o.search(s)) for s in
["a", "ab", "abc", "abcc", "aabcc"]]
[False, True, True, False, False]
>>> s = re.compile("i*") # yes, that s matches “”
>>> s.split("oiooiioooiii") # split ignores that silliness
['o', 'oo', 'ooo', '']
>>> s.sub("x", "oiooiioooiii") # but sub does not
'xoxoxoxoxoxox'
Parsing strings easily
>>> import re
>>> cell = re.compile(r"(?P<row>[$]?[a-z]+)"
r"(?P<col>[$]?[0-9]+)")
>>> m = cell.search("Spreadsheet cell aa$15")
>>> m
<_sre.SRE_Match object at 0x7f220a8e9360>
>>> m.groupdict()
{'col': '$15', 'row': 'aa'}
Formatting after parsing using a regular expression
>>> rc = m.groupdict()
>>> rc
{'col': '$15', 'row': 'aa'}
>>> 'It was row %(row)s and column %(col)s' % rc
'It was row aa and column $15'
>>> txt = "from a1 2 b$22 as well as 4 $c4"
>>> f = r"<%(col)s,%(row)s>"
>>> ";".join(f % m.groupdict() for m in cell.finditer(txt))
'<1,a>;<$22,b>;<4,$c>'
Secret (labs) RE engine - internals
● Originally separate from module “re”
○ As of version 2.0 onwards they’re equivalent
○ Call it “sre” in any backward compatible code
>>> import sre_parse
>>> sre_parse.parse("ab|c")
[('branch', (None, [
[('literal', 97), ('literal', 98)],
[('literal', 99)]
])
)]
Secret Regular Expression Yield
● New module called sre_yield
○ https://github.com/google/sre_yield
● def Values(regex, flags=0, charset=CHARSET)
○ Examines output from sre_parse.parse()
○ Returns a convenient sequence like object
● Sequence has an efficient membership test
○ We were given a regex describing its content
● Some features (lookahead, etc) still missing
○ Easy to add if sequence can contain None
Iterating over all matching strings
>>> import sre_yield
>>> sre_yield.Values(r'1(?P<x>234?|49?)')[:]
['123', '1234', '14', '149']
>>> len(sre_yield.Values('.'))
256
>>> sre_yield.Values('a*')[5:10]
['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']
What do we do about infinite repetitions
>>> len(sre_yield.Values('0*'))
65536
# Yes, really. sre library can only specify 65534 max
>>> a77k = 'a' * 77000
>>> len(re.compile(r'.{,65534}').match(a77k).group(0))
65534
>>> len(re.compile(r'.{,65535}').match(a77k).group(0))
77000
>>> len(re.compile(r'.{60000}.{,6000}|.{,60000}')
.match(a77k).group(0))
66000
How many matching strings
>>> import sre_yield
>>> bits = sre_yield.Values('[01]*') # All binary nums
>>> len(bits) # how many are there?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: long int too large to convert to int
>>> bits.__len__() == 2**65536 - 1 # check the answer
True
>>> len(str(bits.__len__())) # Is the number that big?
19729
>>> "001001" in bits, "002001" in bits
(True, False)
Python does understand working with large numbers
>>> import sre_yield
>>> anything = sre_yield.Values('.*')
>>> a = 1
>>> for _ in xrange(65535): a = a * 256 + 1
>>> anything.__len__() == a
True
>>> str_a = str(a) # This does take a while
>>> len(str_a)
157825
>>> str_a[:9], str_a[-9:]
('101818453', '945826561')
But why bother yielding from a regex
● It can be more compact than a literal list, for example:
ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1|
sa-east-1|us-east-1|us-west-1|us-west-2
● That doesn’t get much shorter when rewritten:
(ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap-
southea)st-2
● On the other hand, others are more convenient:
www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.])
widget[.](?P<domain>com|co[.]uk|ch|de)
● Some things would better be machine generated:
192.168(?:.(?:[1-9]?d|1d{2}|2[0-4]d|25[0-5])){2}
● Implementation uses backtracking, i.e. PCRE
○ So it is fast providing it never guesses wrong
○ Trivial to write an expression that is … slow
def test(n):
t = "a" * n
r = "a?" * n + t
return bool(
re.match(r, t))
timeit.timeit(
stmt="test(6)", setup="from __main__ import test")
How fast is the “re” library
The RE2 library
● https://code.google.com/p/re2
● https://github.com/axiak/pyre2
● RE2 tries all possible code paths in parallel
○ never backtracks, so omits features that need it
● drops support for backreferences
○ and generalized zero-width assertions
● Predictable worst case performance for any input
○ Safe to accept untrusted regular expressions
Test(10) takes 4 milliseconds instead of one minute
Summary
●Regular expressions are built into Python
○re_obj = re.compile(pattern)
○print re_obj.pattern
●They can parse strings into a dictionary
○Or iteratively many dictionaries
●They can compactly represent large lists
○Without expanding the whole iterator out
●For reliable performance, use RE2
○Especially if users are supplying patterns
Questions?
●mail -s us.pycon.org/2014 
○Alex.Perry@Google.com
● Nothing to do with me, but pretty good:
○ http://qntm.org/files/re/re.html

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Functional Programming with JavaScript
Functional Programming with JavaScriptFunctional Programming with JavaScript
Functional Programming with JavaScript
 
Quick 入門 | iOS RDD テストフレームワーク for Swift/Objective-C
Quick 入門 | iOS RDD テストフレームワーク for Swift/Objective-CQuick 入門 | iOS RDD テストフレームワーク for Swift/Objective-C
Quick 入門 | iOS RDD テストフレームワーク for Swift/Objective-C
 
Basicsof c make and git for a hello qt application
Basicsof c make and git for a hello qt applicationBasicsof c make and git for a hello qt application
Basicsof c make and git for a hello qt application
 
Optimizing the Grafana Platform for Flux
Optimizing the Grafana Platform for FluxOptimizing the Grafana Platform for Flux
Optimizing the Grafana Platform for Flux
 
Sol7
Sol7Sol7
Sol7
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
 
2 BytesC++ course_2014_c8_ strings
2 BytesC++ course_2014_c8_ strings 2 BytesC++ course_2014_c8_ strings
2 BytesC++ course_2014_c8_ strings
 
Compositional I/O Stream in Scala
Compositional I/O Stream in ScalaCompositional I/O Stream in Scala
Compositional I/O Stream in Scala
 
Array and functions
Array and functionsArray and functions
Array and functions
 
Sortings
SortingsSortings
Sortings
 
Write a program that calculate the no of prime no,even and odd no.
Write a program that calculate the no of prime no,even and odd no.Write a program that calculate the no of prime no,even and odd no.
Write a program that calculate the no of prime no,even and odd no.
 
2 a networkflow
2 a networkflow2 a networkflow
2 a networkflow
 
Queue implementation
Queue implementationQueue implementation
Queue implementation
 
Reactive Programming in the Browser feat. Scala.js and PureScript
Reactive Programming in the Browser feat. Scala.js and PureScriptReactive Programming in the Browser feat. Scala.js and PureScript
Reactive Programming in the Browser feat. Scala.js and PureScript
 
FS2 for Fun and Profit
FS2 for Fun and ProfitFS2 for Fun and Profit
FS2 for Fun and Profit
 
Flamingo in Production
Flamingo in ProductionFlamingo in Production
Flamingo in Production
 
Flux and InfluxDB 2.0
Flux and InfluxDB 2.0Flux and InfluxDB 2.0
Flux and InfluxDB 2.0
 
Sol 1
Sol 1Sol 1
Sol 1
 
Flamingo Core Concepts
Flamingo Core ConceptsFlamingo Core Concepts
Flamingo Core Concepts
 

Andere mochten auch

Presentation on reliability engineering
Presentation on reliability engineeringPresentation on reliability engineering
Presentation on reliability engineering
Viraj Patil
 
Secure Architecture and Programming 101
Secure Architecture and Programming 101Secure Architecture and Programming 101
Secure Architecture and Programming 101
QAware GmbH
 

Andere mochten auch (20)

Software fault management
Software fault managementSoftware fault management
Software fault management
 
Presentation on reliability engineering
Presentation on reliability engineeringPresentation on reliability engineering
Presentation on reliability engineering
 
Software Architecture Fundamentals Part-1-Architecture soft skills
Software Architecture Fundamentals Part-1-Architecture soft skillsSoftware Architecture Fundamentals Part-1-Architecture soft skills
Software Architecture Fundamentals Part-1-Architecture soft skills
 
Secure Architecture and Programming 101
Secure Architecture and Programming 101Secure Architecture and Programming 101
Secure Architecture and Programming 101
 
Load balancing in the SRE way
Load balancing in the SRE wayLoad balancing in the SRE way
Load balancing in the SRE way
 
Software Reliability Engineering
Software Reliability EngineeringSoftware Reliability Engineering
Software Reliability Engineering
 
System Security Beyond the Libraries
System Security Beyond the LibrariesSystem Security Beyond the Libraries
System Security Beyond the Libraries
 
Getting Your System to Production and Keeping it There
Getting Your System to Production and Keeping it ThereGetting Your System to Production and Keeping it There
Getting Your System to Production and Keeping it There
 
Monolith to Microservices - O’Reilly Oscon
Monolith to Microservices - O’Reilly OsconMonolith to Microservices - O’Reilly Oscon
Monolith to Microservices - O’Reilly Oscon
 
Software Architecture as Systems Dissolve (OOP2016)
Software Architecture as Systems Dissolve (OOP2016)Software Architecture as Systems Dissolve (OOP2016)
Software Architecture as Systems Dissolve (OOP2016)
 
Java memory model
Java memory modelJava memory model
Java memory model
 
Evolving toward Microservices - O’Reilly SACON Keynote
Evolving toward Microservices  - O’Reilly SACON KeynoteEvolving toward Microservices  - O’Reilly SACON Keynote
Evolving toward Microservices - O’Reilly SACON Keynote
 
Staying in Sync: From Transactions to Streams
Staying in Sync: From Transactions to StreamsStaying in Sync: From Transactions to Streams
Staying in Sync: From Transactions to Streams
 
SRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level TalentSRECon USA 2016: Growing your Entry Level Talent
SRECon USA 2016: Growing your Entry Level Talent
 
IntelliJ IDEA - Gems you can find inside
IntelliJ IDEA - Gems you can find insideIntelliJ IDEA - Gems you can find inside
IntelliJ IDEA - Gems you can find inside
 
Java Memory Model
Java Memory ModelJava Memory Model
Java Memory Model
 
Migrating to IntelliJ IDEA from Eclipse
Migrating to IntelliJ IDEA from EclipseMigrating to IntelliJ IDEA from Eclipse
Migrating to IntelliJ IDEA from Eclipse
 
You got a couple Microservices, now what? - Adding SRE to DevOps
You got a couple Microservices, now what?  - Adding SRE to DevOpsYou got a couple Microservices, now what?  - Adding SRE to DevOps
You got a couple Microservices, now what? - Adding SRE to DevOps
 
Radical ideas from the book: The Practice of Cloud System Administration
Radical ideas from the book: The Practice of Cloud System AdministrationRadical ideas from the book: The Practice of Cloud System Administration
Radical ideas from the book: The Practice of Cloud System Administration
 
SRE Tools
SRE ToolsSRE Tools
SRE Tools
 

Ähnlich wie Regular expressions, Alex Perry, Google, PyCon2014

Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin
 
A3 sec -_regular_expressions
A3 sec -_regular_expressionsA3 sec -_regular_expressions
A3 sec -_regular_expressions
a3sec
 
Useful javascript
Useful javascriptUseful javascript
Useful javascript
Lei Kang
 
Developers' mDay 2017. - Bogdan Kecman Oracle
Developers' mDay 2017. - Bogdan Kecman OracleDevelopers' mDay 2017. - Bogdan Kecman Oracle
Developers' mDay 2017. - Bogdan Kecman Oracle
mCloud
 

Ähnlich wie Regular expressions, Alex Perry, Google, PyCon2014 (20)

Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
A3 sec -_regular_expressions
A3 sec -_regular_expressionsA3 sec -_regular_expressions
A3 sec -_regular_expressions
 
Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”Rainer Grimm, “Functional Programming in C++11”
Rainer Grimm, “Functional Programming in C++11”
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Practical Testing of Ruby Core
Practical Testing of Ruby CorePractical Testing of Ruby Core
Practical Testing of Ruby Core
 
PHP tips and tricks
PHP tips and tricks PHP tips and tricks
PHP tips and tricks
 
Useful javascript
Useful javascriptUseful javascript
Useful javascript
 
Eta
EtaEta
Eta
 
Developers' mDay 2017. - Bogdan Kecman Oracle
Developers' mDay 2017. - Bogdan Kecman OracleDevelopers' mDay 2017. - Bogdan Kecman Oracle
Developers' mDay 2017. - Bogdan Kecman Oracle
 
Developers’ mDay u Banjoj Luci - Bogdan Kecman, Oracle – MySQL Server 8.0
Developers’ mDay u Banjoj Luci - Bogdan Kecman, Oracle – MySQL Server 8.0Developers’ mDay u Banjoj Luci - Bogdan Kecman, Oracle – MySQL Server 8.0
Developers’ mDay u Banjoj Luci - Bogdan Kecman, Oracle – MySQL Server 8.0
 
New features in abap
New features in abapNew features in abap
New features in abap
 
Hacking ansible
Hacking ansibleHacking ansible
Hacking ansible
 
Rubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for RubyRubyconfindia2018 - GPU accelerated libraries for Ruby
Rubyconfindia2018 - GPU accelerated libraries for Ruby
 
Functional programming in ruby
Functional programming in rubyFunctional programming in ruby
Functional programming in ruby
 
An overview of Python 2.7
An overview of Python 2.7An overview of Python 2.7
An overview of Python 2.7
 
A tour of Python
A tour of PythonA tour of Python
A tour of Python
 
Дмитрий Верескун «Синтаксический сахар C#»
Дмитрий Верескун «Синтаксический сахар C#»Дмитрий Верескун «Синтаксический сахар C#»
Дмитрий Верескун «Синтаксический сахар C#»
 
Beauty and the beast - Haskell on JVM
Beauty and the beast  - Haskell on JVMBeauty and the beast  - Haskell on JVM
Beauty and the beast - Haskell on JVM
 
Ruby Basics by Rafiq
Ruby Basics by RafiqRuby Basics by Rafiq
Ruby Basics by Rafiq
 
[Start] Scala
[Start] Scala[Start] Scala
[Start] Scala
 

Kürzlich hochgeladen

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 

Regular expressions, Alex Perry, Google, PyCon2014

  • 1. Memorable uses for a Regular Expression library Learning the syntax by examples Alex Perry SRE, Google, Los Angeles April 2014
  • 2. Outline ● Simple Regular Expressions ● import re ○ http://docs.python.org/2/library/re.html ● Parsing ● import sre ● Formatting ● import sre_yield ● Arithmetic ● Performance uncertainty ● import re2
  • 3. Basic Regular Expressions abc “abc” [abc] “a” “b” “c” abc? “ab” “abc” abc* “ab” “abc” “abcc” ... abc+ “abc” “abcc” “abccc” ... abc{3,4} “abccc” “abcccc” ab|c+ “ab” “c+” ab. “ab.” “ab1” … “abn”DOTALL
  • 4. The standard library - compiling >>> import re >>> o = re.compile(“abc?”) >>> [bool(o.match(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, False] >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, True, True]
  • 5. The standard library - endings >>> o = re.compile("^abc?$") >>> [bool(o.search(s)) for s in ["a", "ab", "abc", "abcc", "aabcc"]] [False, True, True, False, False] >>> s = re.compile("i*") # yes, that s matches “” >>> s.split("oiooiioooiii") # split ignores that silliness ['o', 'oo', 'ooo', ''] >>> s.sub("x", "oiooiioooiii") # but sub does not 'xoxoxoxoxoxox'
  • 6. Parsing strings easily >>> import re >>> cell = re.compile(r"(?P<row>[$]?[a-z]+)" r"(?P<col>[$]?[0-9]+)") >>> m = cell.search("Spreadsheet cell aa$15") >>> m <_sre.SRE_Match object at 0x7f220a8e9360> >>> m.groupdict() {'col': '$15', 'row': 'aa'}
  • 7. Formatting after parsing using a regular expression >>> rc = m.groupdict() >>> rc {'col': '$15', 'row': 'aa'} >>> 'It was row %(row)s and column %(col)s' % rc 'It was row aa and column $15' >>> txt = "from a1 2 b$22 as well as 4 $c4" >>> f = r"<%(col)s,%(row)s>" >>> ";".join(f % m.groupdict() for m in cell.finditer(txt)) '<1,a>;<$22,b>;<4,$c>'
  • 8. Secret (labs) RE engine - internals ● Originally separate from module “re” ○ As of version 2.0 onwards they’re equivalent ○ Call it “sre” in any backward compatible code >>> import sre_parse >>> sre_parse.parse("ab|c") [('branch', (None, [ [('literal', 97), ('literal', 98)], [('literal', 99)] ]) )]
  • 9. Secret Regular Expression Yield ● New module called sre_yield ○ https://github.com/google/sre_yield ● def Values(regex, flags=0, charset=CHARSET) ○ Examines output from sre_parse.parse() ○ Returns a convenient sequence like object ● Sequence has an efficient membership test ○ We were given a regex describing its content ● Some features (lookahead, etc) still missing ○ Easy to add if sequence can contain None
  • 10. Iterating over all matching strings >>> import sre_yield >>> sre_yield.Values(r'1(?P<x>234?|49?)')[:] ['123', '1234', '14', '149'] >>> len(sre_yield.Values('.')) 256 >>> sre_yield.Values('a*')[5:10] ['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']
  • 11. What do we do about infinite repetitions >>> len(sre_yield.Values('0*')) 65536 # Yes, really. sre library can only specify 65534 max >>> a77k = 'a' * 77000 >>> len(re.compile(r'.{,65534}').match(a77k).group(0)) 65534 >>> len(re.compile(r'.{,65535}').match(a77k).group(0)) 77000 >>> len(re.compile(r'.{60000}.{,6000}|.{,60000}') .match(a77k).group(0)) 66000
  • 12. How many matching strings >>> import sre_yield >>> bits = sre_yield.Values('[01]*') # All binary nums >>> len(bits) # how many are there? Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: long int too large to convert to int >>> bits.__len__() == 2**65536 - 1 # check the answer True >>> len(str(bits.__len__())) # Is the number that big? 19729 >>> "001001" in bits, "002001" in bits (True, False)
  • 13. Python does understand working with large numbers >>> import sre_yield >>> anything = sre_yield.Values('.*') >>> a = 1 >>> for _ in xrange(65535): a = a * 256 + 1 >>> anything.__len__() == a True >>> str_a = str(a) # This does take a while >>> len(str_a) 157825 >>> str_a[:9], str_a[-9:] ('101818453', '945826561')
  • 14. But why bother yielding from a regex ● It can be more compact than a literal list, for example: ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1| sa-east-1|us-east-1|us-west-1|us-west-2 ● That doesn’t get much shorter when rewritten: (ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap- southea)st-2 ● On the other hand, others are more convenient: www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.]) widget[.](?P<domain>com|co[.]uk|ch|de) ● Some things would better be machine generated: 192.168(?:.(?:[1-9]?d|1d{2}|2[0-4]d|25[0-5])){2}
  • 15. ● Implementation uses backtracking, i.e. PCRE ○ So it is fast providing it never guesses wrong ○ Trivial to write an expression that is … slow def test(n): t = "a" * n r = "a?" * n + t return bool( re.match(r, t)) timeit.timeit( stmt="test(6)", setup="from __main__ import test") How fast is the “re” library
  • 16. The RE2 library ● https://code.google.com/p/re2 ● https://github.com/axiak/pyre2 ● RE2 tries all possible code paths in parallel ○ never backtracks, so omits features that need it ● drops support for backreferences ○ and generalized zero-width assertions ● Predictable worst case performance for any input ○ Safe to accept untrusted regular expressions Test(10) takes 4 milliseconds instead of one minute
  • 17. Summary ●Regular expressions are built into Python ○re_obj = re.compile(pattern) ○print re_obj.pattern ●They can parse strings into a dictionary ○Or iteratively many dictionaries ●They can compactly represent large lists ○Without expanding the whole iterator out ●For reliable performance, use RE2 ○Especially if users are supplying patterns
  • 18. Questions? ●mail -s us.pycon.org/2014 ○Alex.Perry@Google.com ● Nothing to do with me, but pretty good: ○ http://qntm.org/files/re/re.html