Regular expressions, Alex Perry, Google, PyCon2014

Memorable uses for a
Regular Expression
library
Learning the syntax by examples
Alex Perry
SRE, Google, Los Angeles
April 2014

Outline
● Simple Regular Expressions
● import re
○ http://docs.python.org/2/library/re.html
● Parsing
● import sre
● Formatting
● import sre_yield
● Arithmetic
● Performance uncertainty
● import re2

Basic Regular Expressions
abc “abc”
[abc] “a” “b” “c”
abc? “ab” “abc”
abc* “ab” “abc” “abcc” ...
abc+ “abc” “abcc” “abccc” ...
abc{3,4} “abccc” “abcccc”
ab|c+ “ab” “c+”
ab. “ab.” “ab1” … “abn”DOTALL

The standard library - compiling
>>> import re
>>> o = re.compile(“abc?”)
>>> [bool(o.match(s)) for s in
["a", "ab", "abc", "abcc", "aabcc"]]
[False, True, True, True, False]
>>> [bool(o.search(s)) for s in
[False, True, True, True, True]

The standard library - endings
>>> o = re.compile("^abc?$")
>>> [bool(o.search(s)) for s in
[False, True, True, False, False]
>>> s = re.compile("i*") # yes, that s matches “”
>>> s.split("oiooiioooiii") # split ignores that silliness
['o', 'oo', 'ooo', '']
>>> s.sub("x", "oiooiioooiii") # but sub does not
'xoxoxoxoxoxox'

Parsing strings easily
>>> import re
>>> cell = re.compile(r"(?P<row>[$]?[a-z]+)"
r"(?P<col>[$]?[0-9]+)")
>>> m = cell.search("Spreadsheet cell aa$15")
>>> m
<_sre.SRE_Match object at 0x7f220a8e9360>
>>> m.groupdict()
{'col': '$15', 'row': 'aa'}

Formatting after parsing using a regular expression
>>> rc = m.groupdict()
>>> rc
{'col': '$15', 'row': 'aa'}
>>> 'It was row %(row)s and column %(col)s' % rc
'It was row aa and column $15'
>>> txt = "from a1 2 b$22 as well as 4 $c4"
>>> f = r"<%(col)s,%(row)s>"
>>> ";".join(f % m.groupdict() for m in cell.finditer(txt))
'<1,a>;<$22,b>;<4,$c>'

Secret (labs) RE engine - internals
● Originally separate from module “re”
○ As of version 2.0 onwards they’re equivalent
○ Call it “sre” in any backward compatible code
>>> import sre_parse
>>> sre_parse.parse("ab|c")
[('branch', (None, [
[('literal', 97), ('literal', 98)],
[('literal', 99)]
])
)]

Secret Regular Expression Yield
● New module called sre_yield
○ https://github.com/google/sre_yield
● def Values(regex, flags=0, charset=CHARSET)
○ Examines output from sre_parse.parse()
○ Returns a convenient sequence like object
● Sequence has an efficient membership test
○ We were given a regex describing its content
● Some features (lookahead, etc) still missing
○ Easy to add if sequence can contain None

Iterating over all matching strings
>>> import sre_yield
>>> sre_yield.Values(r'1(?P<x>234?|49?)')[:]
['123', '1234', '14', '149']
>>> len(sre_yield.Values('.'))
256
>>> sre_yield.Values('a*')[5:10]
['aaaaa', 'aaaaaa', 'aaaaaaa', 'aaaaaaaa', 'aaaaaaaaa']

What do we do about infinite repetitions
>>> len(sre_yield.Values('0*'))
65536
# Yes, really. sre library can only specify 65534 max
>>> a77k = 'a' * 77000
>>> len(re.compile(r'.{,65534}').match(a77k).group(0))
65534
>>> len(re.compile(r'.{,65535}').match(a77k).group(0))
77000
>>> len(re.compile(r'.{60000}.{,6000}|.{,60000}')
.match(a77k).group(0))
66000

How many matching strings
>>> bits = sre_yield.Values('[01]*') # All binary nums
>>> len(bits) # how many are there?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: long int too large to convert to int
>>> bits.__len__() == 2**65536 - 1 # check the answer
True
>>> len(str(bits.__len__())) # Is the number that big?
19729
>>> "001001" in bits, "002001" in bits
(True, False)

Python does understand working with large numbers
>>> anything = sre_yield.Values('.*')
>>> a = 1
>>> for _ in xrange(65535): a = a * 256 + 1
>>> anything.__len__() == a
True
>>> str_a = str(a) # This does take a while
>>> len(str_a)
157825
>>> str_a[:9], str_a[-9:]
('101818453', '945826561')

But why bother yielding from a regex
● It can be more compact than a literal list, for example:
ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1|
sa-east-1|us-east-1|us-west-1|us-west-2
● That doesn’t get much shorter when rewritten:
(ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap-
southea)st-2
● On the other hand, others are more convenient:
www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.])
widget[.](?P<domain>com|co[.]uk|ch|de)
● Some things would better be machine generated:
192.168(?:.(?:[1-9]?d|1d{2}|2[0-4]d|25[0-5])){2}

● Implementation uses backtracking, i.e. PCRE
○ So it is fast providing it never guesses wrong
○ Trivial to write an expression that is … slow
def test(n):
t = "a" * n
r = "a?" * n + t
return bool(
re.match(r, t))
timeit.timeit(
stmt="test(6)", setup="from __main__ import test")
How fast is the “re” library

The RE2 library
● https://code.google.com/p/re2
● https://github.com/axiak/pyre2
● RE2 tries all possible code paths in parallel
○ never backtracks, so omits features that need it
● drops support for backreferences
○ and generalized zero-width assertions
● Predictable worst case performance for any input
○ Safe to accept untrusted regular expressions
Test(10) takes 4 milliseconds instead of one minute

Summary
●Regular expressions are built into Python
○re_obj = re.compile(pattern)
○print re_obj.pattern
●They can parse strings into a dictionary
○Or iteratively many dictionaries
●They can compactly represent large lists
○Without expanding the whole iterator out
●For reliable performance, use RE2
○Especially if users are supplying patterns

Questions?
●mail -s us.pycon.org/2014
○Alex.Perry@Google.com
● Nothing to do with me, but pretty good:
○ http://qntm.org/files/re/re.html

Regular expressions, Alex Perry, Google, PyCon2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Regular expressions, Alex Perry, Google, PyCon2014

Ähnlich wie Regular expressions, Alex Perry, Google, PyCon2014 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Regular expressions, Alex Perry, Google, PyCon2014