4. The standard library - compiling
>>> import re
>>> o = re.compile(“abc?”)
>>> [bool(o.match(s)) for s in
["a", "ab", "abc", "abcc", "aabcc"]]
[False, True, True, True, False]
>>> [bool(o.search(s)) for s in
["a", "ab", "abc", "abcc", "aabcc"]]
[False, True, True, True, True]
5. The standard library - endings
>>> o = re.compile("^abc?$")
>>> [bool(o.search(s)) for s in
["a", "ab", "abc", "abcc", "aabcc"]]
[False, True, True, False, False]
>>> s = re.compile("i*") # yes, that s matches “”
>>> s.split("oiooiioooiii") # split ignores that silliness
['o', 'oo', 'ooo', '']
>>> s.sub("x", "oiooiioooiii") # but sub does not
'xoxoxoxoxoxox'
6. Parsing strings easily
>>> import re
>>> cell = re.compile(r"(?P<row>[$]?[a-z]+)"
r"(?P<col>[$]?[0-9]+)")
>>> m = cell.search("Spreadsheet cell aa$15")
>>> m
<_sre.SRE_Match object at 0x7f220a8e9360>
>>> m.groupdict()
{'col': '$15', 'row': 'aa'}
7. Formatting after parsing using a regular expression
>>> rc = m.groupdict()
>>> rc
{'col': '$15', 'row': 'aa'}
>>> 'It was row %(row)s and column %(col)s' % rc
'It was row aa and column $15'
>>> txt = "from a1 2 b$22 as well as 4 $c4"
>>> f = r"<%(col)s,%(row)s>"
>>> ";".join(f % m.groupdict() for m in cell.finditer(txt))
'<1,a>;<$22,b>;<4,$c>'
8. Secret (labs) RE engine - internals
● Originally separate from module “re”
○ As of version 2.0 onwards they’re equivalent
○ Call it “sre” in any backward compatible code
>>> import sre_parse
>>> sre_parse.parse("ab|c")
[('branch', (None, [
[('literal', 97), ('literal', 98)],
[('literal', 99)]
])
)]
9. Secret Regular Expression Yield
● New module called sre_yield
○ https://github.com/google/sre_yield
● def Values(regex, flags=0, charset=CHARSET)
○ Examines output from sre_parse.parse()
○ Returns a convenient sequence like object
● Sequence has an efficient membership test
○ We were given a regex describing its content
● Some features (lookahead, etc) still missing
○ Easy to add if sequence can contain None
11. What do we do about infinite repetitions
>>> len(sre_yield.Values('0*'))
65536
# Yes, really. sre library can only specify 65534 max
>>> a77k = 'a' * 77000
>>> len(re.compile(r'.{,65534}').match(a77k).group(0))
65534
>>> len(re.compile(r'.{,65535}').match(a77k).group(0))
77000
>>> len(re.compile(r'.{60000}.{,6000}|.{,60000}')
.match(a77k).group(0))
66000
12. How many matching strings
>>> import sre_yield
>>> bits = sre_yield.Values('[01]*') # All binary nums
>>> len(bits) # how many are there?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: long int too large to convert to int
>>> bits.__len__() == 2**65536 - 1 # check the answer
True
>>> len(str(bits.__len__())) # Is the number that big?
19729
>>> "001001" in bits, "002001" in bits
(True, False)
13. Python does understand working with large numbers
>>> import sre_yield
>>> anything = sre_yield.Values('.*')
>>> a = 1
>>> for _ in xrange(65535): a = a * 256 + 1
>>> anything.__len__() == a
True
>>> str_a = str(a) # This does take a while
>>> len(str_a)
157825
>>> str_a[:9], str_a[-9:]
('101818453', '945826561')
14. But why bother yielding from a regex
● It can be more compact than a literal list, for example:
ap-northeast-1|ap-southeast-1|ap-southeast-2|eu-west-1|
sa-east-1|us-east-1|us-west-1|us-west-2
● That doesn’t get much shorter when rewritten:
(ap-(nor|sou)th|sa-|us-)east-1|(eu|us)-west-1|(us-we|ap-
southea)st-2
● On the other hand, others are more convenient:
www-(?P<replica>[1-8])[.]((:?P<fleet>canary|beta)[.])
widget[.](?P<domain>com|co[.]uk|ch|de)
● Some things would better be machine generated:
192.168(?:.(?:[1-9]?d|1d{2}|2[0-4]d|25[0-5])){2}
15. ● Implementation uses backtracking, i.e. PCRE
○ So it is fast providing it never guesses wrong
○ Trivial to write an expression that is … slow
def test(n):
t = "a" * n
r = "a?" * n + t
return bool(
re.match(r, t))
timeit.timeit(
stmt="test(6)", setup="from __main__ import test")
How fast is the “re” library
16. The RE2 library
● https://code.google.com/p/re2
● https://github.com/axiak/pyre2
● RE2 tries all possible code paths in parallel
○ never backtracks, so omits features that need it
● drops support for backreferences
○ and generalized zero-width assertions
● Predictable worst case performance for any input
○ Safe to accept untrusted regular expressions
Test(10) takes 4 milliseconds instead of one minute
17. Summary
●Regular expressions are built into Python
○re_obj = re.compile(pattern)
○print re_obj.pattern
●They can parse strings into a dictionary
○Or iteratively many dictionaries
●They can compactly represent large lists
○Without expanding the whole iterator out
●For reliable performance, use RE2
○Especially if users are supplying patterns