2. About A3Sec
● AlienVault's spin-off
● Professional Services, SIEM deployments
● Alienvault's Authorized Training Center (ATC)
for Spain and LATAM
● Team of more than 25 Security Experts
● Own developments and tool integrations
● Advanced Health Check Monitoring
● Web: www.a3sec.com, Twitter: @a3sec
3. About Me
● David Gil <dgil@a3sec.com>
● Developer, Sysadmin, Project Manager
● Really believes in Open Source model
● Programming since he was 9 years old
● Ossim developer at its early stage
● Agent core engine (full regex) and first plugins
● Python lover :-)
● Debian package maintainer (a long, long time ago)
● Sci-Fi books reader and mountain bike rider
4. Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
7. Regular Expressions
What is a regex?
Regular expression:
(bb|[^b]{2})dd
Input strings:
bb445, 2ac3357bb, bb3aa2c7,
a2ab64b, abb83fh6l3hi22ui
8. Regular Expressions
What is a regex?
Regular expression:
(bb|[^b]{2})dd
Input strings:
bb445, 2ac3357bb, bb3aa2c7,
a2ab64b, abb83fh6l3hi22ui
9. Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
10. Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve
11. Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve
● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.
12. Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve
● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.
● Use standard parsing libraries!
Formats: JSON, HTML, XML, CSV, etc.
13. Regular Expressions
To RE or not to RE
Example: URL parsing
● regex:
^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$
● parse_url() php method:
$url = "http://username:password@hostname/path?arg=value#anchor";
print_r(parse_url($url));
(
[scheme] => http
[host] => hostname
[user] => username
[pass] => password
[path] => /path
[query] => arg=value
[fragment] => anchor
)
14. Regular Expressions
To RE or not to RE
But, there are a lot of reasons to use regex:
● powerful
● portable
● fast (with performance in mind)
● useful for complex patterns
● save development time
● short code
● fun :-)
● beautiful?
15. Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
20. Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']
● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
21. Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']
● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
● Minimal matching, non-greedy
>>> re.findall('<(.*)>.*', '<B>i am bold</B>')
>>> re.findall('<(.*?)>.*', '<B>i am bold</B>')
22. Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
25. Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205
YES len=4
word
1.65614509583
YES len=25
wordlongerthanpreviousone..
1.92520785332
YES len=60
wordlongerthanpreviosoneplusan..
2.38850092888
YES len=120
wordlongerthanpreviosoneplusan..
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
not a word, just a phrase..
1.92521882057
NOP len=50
not a word, just a phrase bigg..
2.39075493813
NOP len=102
not a word, just a phrase bigg..
26. Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205
YES len=4
word
1.65614509583
YES len=25
wordlongerthanpreviousone..
1.92520785332
YES len=60
wordlongerthanpreviosoneplusan..
2.38850092888
YES len=120
wordlongerthanpreviosoneplusan..
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
not a word, just a phrase..
1.92521882057
NOP len=50
not a word, just a phrase bigg..
2.39075493813
NOP len=102
not a word, just a phrase bigg..
If the target string is longer, the regex matching
is slower. No matter if success or fail.
28. Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4
word
1.0725839138
YES len=25
wordlongerthanpreviousone..
2.34717106819
YES len=60
wordlongerthanpreviosoneplusan..
4.31543898582
YES len=120
wordlongerthanpreviosoneplusan..
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
not a word, just a phrase..
0.546499967575 NOP len=50
not a word, just a phrase bigg..
0.553755998611 NOP len=102
not a word, just a phrase bigg..
29. Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4
word
1.0725839138
YES len=25
wordlongerthanpreviousone..
2.34717106819
YES len=60
wordlongerthanpreviosoneplusan..
4.31543898582
YES len=120
wordlongerthanpreviosoneplusan..
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
not a word, just a phrase..
0.546499967575 NOP len=50
not a word, just a phrase bigg..
0.553755998611 NOP len=102
not a word, just a phrase bigg..
2 python nested loops if success (slow)
But fails at the same point&time (first space)
31. Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4
word
0.212563037872 YES len=25
wordlongerthanpreviousone..
0.318686008453 YES len=60
wordlongerthanpreviosoneplusan..
0.493942975998 YES len=120
wordlongerthanpreviosoneplusan..
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
not a word, just a phrase..
0.147103071213 NOP len=50
not a word, just a phrase bigg..
0.146239995956 NOP len=102
not a word, just a phrase bigg..
32. Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4
word
0.212563037872 YES len=25
wordlongerthanpreviousone..
0.318686008453 YES len=60
wordlongerthanpreviosoneplusan..
0.493942975998 YES len=120
wordlongerthanpreviosoneplusan..
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
not a word, just a phrase..
0.147103071213 NOP len=50
not a word, just a phrase bigg..
0.146239995956 NOP len=102
not a word, just a phrase bigg..
Python string functions (fast and small C loops)
33. Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
40. Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)
● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
41. Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)
● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
TRAFFIC_(ALLOW|DROP|DENY)
42. Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)
● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
TRAFFIC_(ALLOW|DROP|DENY)
● Use anchors (^ and $) to limit the score
re.findall(r'(ab){2}', 'abcabcabc')
re.findall(r'^(ab){2}','abcabcabc') #failures occur faster
43. Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
45. Regular Expressions
Performance Strategies
Writing Agent plugins
● A new process is forked for each loaded
plugin
○ Use the plugins that you really need!
● A plugin is a set of rules (regexp operations)
for matching log lines
○ If a plugin doesn't match a log entry, it fails in ALL its
rules!
○ Reduce the number of rules, use a [translation] table
47. Regular Expressions
Performance Strategies
Writing Agent plugins
● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to
match first
● Divide and conquer
○ A plugin is configured to read from a source file, use
dedicated source files per technology
○ Also, use dedicated plugins for each technology
50. Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools