SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen
Regular Expressions
Performance
Optimizing event capture building
better Ossim Agent plugins
About A3Sec
● AlienVault's spin-off
● Professional Services, SIEM deployments
● Alienvault's Authorized Training Center (ATC)
for Spain and LATAM
● Team of more than 25 Security Experts
● Own developments and tool integrations
● Advanced Health Check Monitoring
● Web: www.a3sec.com, Twitter: @a3sec
About Me
● David Gil <dgil@a3sec.com>
● Developer, Sysadmin, Project Manager
● Really believes in Open Source model
● Programming since he was 9 years old
● Ossim developer at its early stage
● Agent core engine (full regex) and first plugins
● Python lover :-)
● Debian package maintainer (a long, long time ago)
● Sci-Fi books reader and mountain bike rider
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})
Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})dd
Input strings:

bb445, 2ac3357bb, bb3aa2c7,
a2ab64b, abb83fh6l3hi22ui
Regular Expressions
What is a regex?
Regular expression:

(bb|[^b]{2})dd
Input strings:

bb445, 2ac3357bb, bb3aa2c7,
a2ab64b, abb83fh6l3hi22ui
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve
Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve

● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.
Regular Expressions
To RE or not to RE
● Regular expressions are almost never the
right answer
○ Difficult to debug and maintain
○ Performance reasons, slower for simple matching
○ Learning curve

● Python string functions are small C loops:
super fast!
○ beginswith(), endswith(), split(), etc.

● Use standard parsing libraries!
Formats: JSON, HTML, XML, CSV, etc.
Regular Expressions
To RE or not to RE
Example: URL parsing
● regex:
^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$

● parse_url() php method:
$url = "http://username:password@hostname/path?arg=value#anchor";
print_r(parse_url($url));
(
[scheme] => http
[host] => hostname
[user] => username
[pass] => password
[path] => /path
[query] => arg=value
[fragment] => anchor
)
Regular Expressions
To RE or not to RE
But, there are a lot of reasons to use regex:
● powerful
● portable
● fast (with performance in mind)
● useful for complex patterns
● save development time
● short code
● fun :-)
● beautiful?
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Basics - Characters
● d, D: digits. w, W: words. s, S: spaces
>>> re.findall('dddd-(dd)-dd', '2013-07-21')
>>> re.findall('(S+)s+(S+)', 'foo bar')

● ^, $: Begin/End of string
>>> re.findall('(d+)', 'cba3456csw')
>>> re.findall('^(d+)$', 'cba3456csw')

● . (dot): Any character:
>>> re.findall('foo(.)bar', 'foo=bar')
>>> re.findall('(...)=(...)', 'foo=bar')
Regular Expressions
Basics - Repetitions
● *, +: 0-1 or more repetitions
>>> re.findall('FO+', 'FOOOOOOOOO')
>>> re.findall('BA*R', 'BR')

● ?: 0 or 1 repetitions
>>> re.findall('colou?r', 'color')
>>> re.findall('colou?r', 'colour')

● {n}, {n,m}: N repetitions:
>>> re.findall('d{2}', '2013-07-21')
>>> re.findall('d{1,3}.d{1,3}.d{1,3}.d{1,3}','192.168.1.25')
Regular Expressions
Basics - Groups
[...]: Set of characters
>>> re.findall('[a-z]+=[a-z]+', 'foo=bar')

...|...: Alternation
>>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar')

(...) and 1, 2, ...: Group
>>> re.findall(r'(w+)=(1)', 'foo=bar')
>>> re.findall(r'(w+)=(1)', 'foo=foo')

(?P<name>...): Named group
>>> re.findall('d{4}-d{2}-(?P<day>d{2}'), '2013-07-23')
Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']
Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']

● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
Regular Expressions
Greedy & Lazy quantifiers: *?, +?
● Greedy vs non-greedy (lazy)
>>> re.findall('A+', 'AAAA')
['AAAA']
>>> re.findall('A+?', 'AAAA')
['A', 'A', 'A', 'A']

● An overall match takes precedence over and
overall non-match
>>> re.findall('<.*>.*</.*>', '<B>i am bold</B>')
>>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')

● Minimal matching, non-greedy
>>> re.findall('<(.*)>.*', '<B>i am bold</B>')
>>> re.findall('<(.*?)>.*', '<B>i am bold</B>')
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Performance Tests
Different implementations of a custom
is_a_word() function:
● #1 Regexp
● #2 Char iteration
● #3 String functions
Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205
YES len=4
word
1.65614509583
YES len=25
wordlongerthanpreviousone..
1.92520785332
YES len=60
wordlongerthanpreviosoneplusan..
2.38850092888
YES len=120
wordlongerthanpreviosoneplusan..
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
not a word, just a phrase..
1.92521882057
NOP len=50
not a word, just a phrase bigg..
2.39075493813
NOP len=102
not a word, just a phrase bigg..
Regular Expressions
Performance Test #1
def is_a_word(word):
CHARS = string.uppercase + string.lowercase
regexp = r'^[%s]+$' % CHARS
if re.search(regexp, word) return "YES" else "NOP"
timeit.timeit(s, 'is_a_word(%s)' %(w))
1.49650502205
YES len=4
word
1.65614509583
YES len=25
wordlongerthanpreviousone..
1.92520785332
YES len=60
wordlongerthanpreviosoneplusan..
2.38850092888
YES len=120
wordlongerthanpreviosoneplusan..
1.55924701691
NOP len=10
not a word
1.7087020874
NOP len=25
not a word, just a phrase..
1.92521882057
NOP len=50
not a word, just a phrase bigg..
2.39075493813
NOP len=102
not a word, just a phrase bigg..

If the target string is longer, the regex matching
is slower. No matter if success or fail.
Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4
word
1.0725839138
YES len=25
wordlongerthanpreviousone..
2.34717106819
YES len=60
wordlongerthanpreviosoneplusan..
4.31543898582
YES len=120
wordlongerthanpreviosoneplusan..
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
not a word, just a phrase..
0.546499967575 NOP len=50
not a word, just a phrase bigg..
0.553755998611 NOP len=102
not a word, just a phrase bigg..
Regular Expressions
Performance Test #2
def is_a_word(word):
for char in word:
if not char in (CHARS): return "NOP"
return "YES"
timeit.timeit(s, 'is_a_word(%s)' %(w))
0.687522172928 YES len=4
word
1.0725839138
YES len=25
wordlongerthanpreviousone..
2.34717106819
YES len=60
wordlongerthanpreviosoneplusan..
4.31543898582
YES len=120
wordlongerthanpreviosoneplusan..
0.54797577858
NOP len=10
not a word
0.547253847122 NOP len=25
not a word, just a phrase..
0.546499967575 NOP len=50
not a word, just a phrase bigg..
0.553755998611 NOP len=102
not a word, just a phrase bigg..

2 python nested loops if success (slow)
But fails at the same point&time (first space)
Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"
Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4
word
0.212563037872 YES len=25
wordlongerthanpreviousone..
0.318686008453 YES len=60
wordlongerthanpreviosoneplusan..
0.493942975998 YES len=120
wordlongerthanpreviosoneplusan..
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
not a word, just a phrase..
0.147103071213 NOP len=50
not a word, just a phrase bigg..
0.146239995956 NOP len=102
not a word, just a phrase bigg..
Regular Expressions
Performance Test #3
def is_a_word(word):
return "YES" if word.isalpha() else "NOP"

timeit.timeit(s, 'is_a_word(%s)' %(w))
0.146447896957 YES len=4
word
0.212563037872 YES len=25
wordlongerthanpreviousone..
0.318686008453 YES len=60
wordlongerthanpreviosoneplusan..
0.493942975998 YES len=120
wordlongerthanpreviosoneplusan..
0.14647102356 NOP len=10
not a word
0.146160840988 NOP len=25
not a word, just a phrase..
0.147103071213 NOP len=50
not a word, just a phrase bigg..
0.146239995956 NOP len=102
not a word, just a phrase bigg..

Python string functions (fast and small C loops)
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...

● Be careful with wildcards
re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...

● Be careful with wildcards
re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower
re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster
Regular Expressions
Performance Strategies
Writing regex
● Be careful with repetitions (+, *, {n,m})
(abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
(abc|def){2,1000} produces ...

● Be careful with wildcards
re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower
re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster

● Longer target string -> slower regex
matching
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
TRAFFIC_(ALLOW|DROP|DENY)
Regular Expressions
Performance Strategies
Writing regex
● Use the non-capturing group when no need
to capture and save text to a variable
(?:abc|def|ghi) instead of (abc|def|ghi)

● Pattern most likely to match first
(TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
TRAFFIC_(ALLOW|DROP|DENY)

● Use anchors (^ and $) to limit the score
re.findall(r'(ab){2}', 'abcabcabc')
re.findall(r'^(ab){2}','abcabcabc') #failures occur faster
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Performance Strategies
Writing Agent plugins
● A new process is forked for each loaded
plugin
○ Use the plugins that you really need!
Regular Expressions
Performance Strategies
Writing Agent plugins
● A new process is forked for each loaded
plugin
○ Use the plugins that you really need!

● A plugin is a set of rules (regexp operations)
for matching log lines
○ If a plugin doesn't match a log entry, it fails in ALL its
rules!
○ Reduce the number of rules, use a [translation] table
Regular Expressions
Performance Strategies
Writing Agent plugins
● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to
match first
Regular Expressions
Performance Strategies
Writing Agent plugins
● Alphabetical order for rule matching
○ Order your rules by priority, pattern most likely to
match first

● Divide and conquer
○ A plugin is configured to read from a source file, use
dedicated source files per technology
○ Also, use dedicated plugins for each technology
Regular Expressions
Performance Strategies
Tool1
Tool2
Tool3
Tool4
Tool5

20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec

/var/log/syslog
(100 logs/sec)

5 plugins with 1 rule reading /var/log/syslog
5x100 = 500 total regex/sec
Regular Expressions
Performance Strategies
Tool1
Tool2
Tool3
Tool4
Tool5

20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec
20 logs/sec

/var/log/tool1
/var/log/tool2
/var/log/tool3
/var/log/tool4
/var/log/tool5
(100 logs/sec)

5 plugins with 1 rule reading /var/log/tool{1-5}
5x20 = 100 total regex/sec (x5) Faster
Summary
1. What is a regexp?
2. When to use regexp?
3. Regex basics
4. Performance Tests
5. Writing regexp (Performance Strategies)
6. Writing plugins (Performance Strategies)
7. Tools
Regular Expressions
Tools for testing Regex
Python:
>>> import re
>>> re.findall('(S+) (S+)', 'foo bar')
[('foo', 'bar')]
>>> result = re.search(
...
'(?P<key>w+)s*=s*(?P<value>w+)',
...
'foo=bar'
... )
>>> result.groupdict()
{ 'key': 'foo', 'value': 'bar' }
Regular Expressions
Tools for testing Regex
Regex debuggers:
● Kiki
● Kodos
Online regex testers:
● http://gskinner.com/RegExr/ (java)
● http://regexpal.com/ (javascript)
● http://rubular.com/ (ruby)
● http://www.pythonregex.com/ (python)
Online regex visualization:
● http://www.regexper.com/ (javascript)
any (?:question|doubt|comment)+?
A3Sec
web: www.a3sec.com
email: training@a3sec.com
twitter: @a3sec
Spain Head Office
C/ Aravaca, 6, Piso 2
28040 Madrid
Tlf. +34 533 09 78
México Head Office
Avda. Paseo de la Reforma, 389 Piso 10
México DF
Tlf. +52 55 5980 3547

Weitere ähnliche Inhalte

Was ist angesagt?

From typing the test to testing the type
From typing the test to testing the typeFrom typing the test to testing the type
From typing the test to testing the typeWim Godden
 
Perl 6 for Concurrency and Parallel Computing
Perl 6 for Concurrency and Parallel ComputingPerl 6 for Concurrency and Parallel Computing
Perl 6 for Concurrency and Parallel ComputingAndrew Shitov
 
Presentation aalpert v6а+++
Presentation aalpert v6а+++Presentation aalpert v6а+++
Presentation aalpert v6а+++Natalia Gorak
 
Benchmarking Perl Lightning Talk (NPW 2007)
Benchmarking Perl Lightning Talk (NPW 2007)Benchmarking Perl Lightning Talk (NPW 2007)
Benchmarking Perl Lightning Talk (NPW 2007)brian d foy
 
Testing Code and Assuring Quality
Testing Code and Assuring QualityTesting Code and Assuring Quality
Testing Code and Assuring QualityKent Cowgill
 
Designing with Groovy Traits - Gr8Conf India
Designing with Groovy Traits - Gr8Conf IndiaDesigning with Groovy Traits - Gr8Conf India
Designing with Groovy Traits - Gr8Conf IndiaNaresha K
 
Power shell voor developers
Power shell voor developersPower shell voor developers
Power shell voor developersDennis Vroegop
 
groovy rules
groovy rulesgroovy rules
groovy rulesPaul King
 

Was ist angesagt? (10)

From typing the test to testing the type
From typing the test to testing the typeFrom typing the test to testing the type
From typing the test to testing the type
 
What's New in ZF 1.10
What's New in ZF 1.10What's New in ZF 1.10
What's New in ZF 1.10
 
Perl 6 for Concurrency and Parallel Computing
Perl 6 for Concurrency and Parallel ComputingPerl 6 for Concurrency and Parallel Computing
Perl 6 for Concurrency and Parallel Computing
 
Presentation aalpert v6а+++
Presentation aalpert v6а+++Presentation aalpert v6а+++
Presentation aalpert v6а+++
 
Benchmarking Perl Lightning Talk (NPW 2007)
Benchmarking Perl Lightning Talk (NPW 2007)Benchmarking Perl Lightning Talk (NPW 2007)
Benchmarking Perl Lightning Talk (NPW 2007)
 
Testing Code and Assuring Quality
Testing Code and Assuring QualityTesting Code and Assuring Quality
Testing Code and Assuring Quality
 
Refresher
RefresherRefresher
Refresher
 
Designing with Groovy Traits - Gr8Conf India
Designing with Groovy Traits - Gr8Conf IndiaDesigning with Groovy Traits - Gr8Conf India
Designing with Groovy Traits - Gr8Conf India
 
Power shell voor developers
Power shell voor developersPower shell voor developers
Power shell voor developers
 
groovy rules
groovy rulesgroovy rules
groovy rules
 

Ähnlich wie A3 sec -_regular_expressions

Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014alex_perry
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsSerge Smetana
 
Unit Test Your Database
Unit Test Your DatabaseUnit Test Your Database
Unit Test Your DatabaseDavid Wheeler
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Аліна Шепшелей
 
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"Inhacking
 
Automated Frontend Testing
Automated Frontend TestingAutomated Frontend Testing
Automated Frontend TestingNeil Crosby
 
Coder Presentation
Coder  PresentationCoder  Presentation
Coder PresentationDoug Green
 
Benchmarking and PHPBench
Benchmarking and PHPBenchBenchmarking and PHPBench
Benchmarking and PHPBenchdantleech
 
"How keep normal blood pressure using TDD" By Roman Loparev
"How keep normal blood pressure using TDD" By Roman Loparev"How keep normal blood pressure using TDD" By Roman Loparev
"How keep normal blood pressure using TDD" By Roman LoparevCiklum Ukraine
 
Developer Tests - Things to Know
Developer Tests - Things to KnowDeveloper Tests - Things to Know
Developer Tests - Things to KnowVaidas Pilkauskas
 
Review unknown code with static analysis Zend con 2017
Review unknown code with static analysis  Zend con 2017Review unknown code with static analysis  Zend con 2017
Review unknown code with static analysis Zend con 2017Damien Seguy
 
Fundamentals of computer programming by Dr. A. Charan Kumari
Fundamentals of computer programming by Dr. A. Charan KumariFundamentals of computer programming by Dr. A. Charan Kumari
Fundamentals of computer programming by Dr. A. Charan KumariTHE NORTHCAP UNIVERSITY
 
Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)Damien Seguy
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonPython Ireland
 
AOP in Python API design
AOP in Python API designAOP in Python API design
AOP in Python API designmeij200
 
How To Test Everything
How To Test EverythingHow To Test Everything
How To Test Everythingnoelrap
 
Ranges calendar-novosibirsk-2015-08
Ranges calendar-novosibirsk-2015-08Ranges calendar-novosibirsk-2015-08
Ranges calendar-novosibirsk-2015-08Platonov Sergey
 

Ähnlich wie A3 sec -_regular_expressions (20)

Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014Regular expressions, Alex Perry, Google, PyCon2014
Regular expressions, Alex Perry, Google, PyCon2014
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails Applications
 
Unit Test Your Database
Unit Test Your DatabaseUnit Test Your Database
Unit Test Your Database
 
Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.Valerii Vasylkov Erlang. measurements and benefits.
Valerii Vasylkov Erlang. measurements and benefits.
 
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
SE2016 Exotic Valerii Vasylkov "Erlang. Measurements and benefits"
 
Automated Frontend Testing
Automated Frontend TestingAutomated Frontend Testing
Automated Frontend Testing
 
Coder Presentation
Coder  PresentationCoder  Presentation
Coder Presentation
 
Benchmarking and PHPBench
Benchmarking and PHPBenchBenchmarking and PHPBench
Benchmarking and PHPBench
 
"How keep normal blood pressure using TDD" By Roman Loparev
"How keep normal blood pressure using TDD" By Roman Loparev"How keep normal blood pressure using TDD" By Roman Loparev
"How keep normal blood pressure using TDD" By Roman Loparev
 
Developer Tests - Things to Know
Developer Tests - Things to KnowDeveloper Tests - Things to Know
Developer Tests - Things to Know
 
Review unknown code with static analysis Zend con 2017
Review unknown code with static analysis  Zend con 2017Review unknown code with static analysis  Zend con 2017
Review unknown code with static analysis Zend con 2017
 
Fundamentals of computer programming by Dr. A. Charan Kumari
Fundamentals of computer programming by Dr. A. Charan KumariFundamentals of computer programming by Dr. A. Charan Kumari
Fundamentals of computer programming by Dr. A. Charan Kumari
 
Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)Php Code Audits (PHP UK 2010)
Php Code Audits (PHP UK 2010)
 
R meetup talk
R meetup talkR meetup talk
R meetup talk
 
Object Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in PythonObject Orientation vs. Functional Programming in Python
Object Orientation vs. Functional Programming in Python
 
Headless Js Testing
Headless Js TestingHeadless Js Testing
Headless Js Testing
 
AOP in Python API design
AOP in Python API designAOP in Python API design
AOP in Python API design
 
How To Test Everything
How To Test EverythingHow To Test Everything
How To Test Everything
 
Ruby on Rails
Ruby on RailsRuby on Rails
Ruby on Rails
 
Ranges calendar-novosibirsk-2015-08
Ranges calendar-novosibirsk-2015-08Ranges calendar-novosibirsk-2015-08
Ranges calendar-novosibirsk-2015-08
 

A3 sec -_regular_expressions

  • 1. Regular Expressions Performance Optimizing event capture building better Ossim Agent plugins
  • 2. About A3Sec ● AlienVault's spin-off ● Professional Services, SIEM deployments ● Alienvault's Authorized Training Center (ATC) for Spain and LATAM ● Team of more than 25 Security Experts ● Own developments and tool integrations ● Advanced Health Check Monitoring ● Web: www.a3sec.com, Twitter: @a3sec
  • 3. About Me ● David Gil <dgil@a3sec.com> ● Developer, Sysadmin, Project Manager ● Really believes in Open Source model ● Programming since he was 9 years old ● Ossim developer at its early stage ● Agent core engine (full regex) and first plugins ● Python lover :-) ● Debian package maintainer (a long, long time ago) ● Sci-Fi books reader and mountain bike rider
  • 4. Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 5.
  • 6. Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})
  • 7. Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})dd Input strings: bb445, 2ac3357bb, bb3aa2c7, a2ab64b, abb83fh6l3hi22ui
  • 8. Regular Expressions What is a regex? Regular expression: (bb|[^b]{2})dd Input strings: bb445, 2ac3357bb, bb3aa2c7, a2ab64b, abb83fh6l3hi22ui
  • 9. Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 10. Regular Expressions To RE or not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve
  • 11. Regular Expressions To RE or not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve ● Python string functions are small C loops: super fast! ○ beginswith(), endswith(), split(), etc.
  • 12. Regular Expressions To RE or not to RE ● Regular expressions are almost never the right answer ○ Difficult to debug and maintain ○ Performance reasons, slower for simple matching ○ Learning curve ● Python string functions are small C loops: super fast! ○ beginswith(), endswith(), split(), etc. ● Use standard parsing libraries! Formats: JSON, HTML, XML, CSV, etc.
  • 13. Regular Expressions To RE or not to RE Example: URL parsing ● regex: ^(https?://)?([da-z.-]+).([a-z.]{2,6})([/w .-]*)*/?$ ● parse_url() php method: $url = "http://username:password@hostname/path?arg=value#anchor"; print_r(parse_url($url)); ( [scheme] => http [host] => hostname [user] => username [pass] => password [path] => /path [query] => arg=value [fragment] => anchor )
  • 14. Regular Expressions To RE or not to RE But, there are a lot of reasons to use regex: ● powerful ● portable ● fast (with performance in mind) ● useful for complex patterns ● save development time ● short code ● fun :-) ● beautiful?
  • 15. Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 16. Regular Expressions Basics - Characters ● d, D: digits. w, W: words. s, S: spaces >>> re.findall('dddd-(dd)-dd', '2013-07-21') >>> re.findall('(S+)s+(S+)', 'foo bar') ● ^, $: Begin/End of string >>> re.findall('(d+)', 'cba3456csw') >>> re.findall('^(d+)$', 'cba3456csw') ● . (dot): Any character: >>> re.findall('foo(.)bar', 'foo=bar') >>> re.findall('(...)=(...)', 'foo=bar')
  • 17. Regular Expressions Basics - Repetitions ● *, +: 0-1 or more repetitions >>> re.findall('FO+', 'FOOOOOOOOO') >>> re.findall('BA*R', 'BR') ● ?: 0 or 1 repetitions >>> re.findall('colou?r', 'color') >>> re.findall('colou?r', 'colour') ● {n}, {n,m}: N repetitions: >>> re.findall('d{2}', '2013-07-21') >>> re.findall('d{1,3}.d{1,3}.d{1,3}.d{1,3}','192.168.1.25')
  • 18. Regular Expressions Basics - Groups [...]: Set of characters >>> re.findall('[a-z]+=[a-z]+', 'foo=bar') ...|...: Alternation >>> re.findall('(foo|bar)=(foo|bar)', 'foo=bar') (...) and 1, 2, ...: Group >>> re.findall(r'(w+)=(1)', 'foo=bar') >>> re.findall(r'(w+)=(1)', 'foo=foo') (?P<name>...): Named group >>> re.findall('d{4}-d{2}-(?P<day>d{2}'), '2013-07-23')
  • 19. Regular Expressions Greedy & Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A']
  • 20. Regular Expressions Greedy & Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] ● An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*</.*>', '<B>i am bold</B>') >>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>')
  • 21. Regular Expressions Greedy & Lazy quantifiers: *?, +? ● Greedy vs non-greedy (lazy) >>> re.findall('A+', 'AAAA') ['AAAA'] >>> re.findall('A+?', 'AAAA') ['A', 'A', 'A', 'A'] ● An overall match takes precedence over and overall non-match >>> re.findall('<.*>.*</.*>', '<B>i am bold</B>') >>> re.findall('<(.*)>.*</(.*)>', '<B>i am bold</B>') ● Minimal matching, non-greedy >>> re.findall('<(.*)>.*', '<B>i am bold</B>') >>> re.findall('<(.*?)>.*', '<B>i am bold</B>')
  • 22. Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 23. Regular Expressions Performance Tests Different implementations of a custom is_a_word() function: ● #1 Regexp ● #2 Char iteration ● #3 String functions
  • 24. Regular Expressions Performance Test #1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP"
  • 25. Regular Expressions Performance Test #1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg..
  • 26. Regular Expressions Performance Test #1 def is_a_word(word): CHARS = string.uppercase + string.lowercase regexp = r'^[%s]+$' % CHARS if re.search(regexp, word) return "YES" else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 1.49650502205 YES len=4 word 1.65614509583 YES len=25 wordlongerthanpreviousone.. 1.92520785332 YES len=60 wordlongerthanpreviosoneplusan.. 2.38850092888 YES len=120 wordlongerthanpreviosoneplusan.. 1.55924701691 NOP len=10 not a word 1.7087020874 NOP len=25 not a word, just a phrase.. 1.92521882057 NOP len=50 not a word, just a phrase bigg.. 2.39075493813 NOP len=102 not a word, just a phrase bigg.. If the target string is longer, the regex matching is slower. No matter if success or fail.
  • 27. Regular Expressions Performance Test #2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES"
  • 28. Regular Expressions Performance Test #2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg..
  • 29. Regular Expressions Performance Test #2 def is_a_word(word): for char in word: if not char in (CHARS): return "NOP" return "YES" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.687522172928 YES len=4 word 1.0725839138 YES len=25 wordlongerthanpreviousone.. 2.34717106819 YES len=60 wordlongerthanpreviosoneplusan.. 4.31543898582 YES len=120 wordlongerthanpreviosoneplusan.. 0.54797577858 NOP len=10 not a word 0.547253847122 NOP len=25 not a word, just a phrase.. 0.546499967575 NOP len=50 not a word, just a phrase bigg.. 0.553755998611 NOP len=102 not a word, just a phrase bigg.. 2 python nested loops if success (slow) But fails at the same point&time (first space)
  • 30. Regular Expressions Performance Test #3 def is_a_word(word): return "YES" if word.isalpha() else "NOP"
  • 31. Regular Expressions Performance Test #3 def is_a_word(word): return "YES" if word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg..
  • 32. Regular Expressions Performance Test #3 def is_a_word(word): return "YES" if word.isalpha() else "NOP" timeit.timeit(s, 'is_a_word(%s)' %(w)) 0.146447896957 YES len=4 word 0.212563037872 YES len=25 wordlongerthanpreviousone.. 0.318686008453 YES len=60 wordlongerthanpreviosoneplusan.. 0.493942975998 YES len=120 wordlongerthanpreviosoneplusan.. 0.14647102356 NOP len=10 not a word 0.146160840988 NOP len=25 not a word, just a phrase.. 0.147103071213 NOP len=50 not a word, just a phrase bigg.. 0.146239995956 NOP len=102 not a word, just a phrase bigg.. Python string functions (fast and small C loops)
  • 33. Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 34. Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)?
  • 35. Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ...
  • 36. Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef')
  • 37. Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster
  • 38. Regular Expressions Performance Strategies Writing regex ● Be careful with repetitions (+, *, {n,m}) (abc|def){2,4} produces (abc|def)(abc|def)((abc|def)(abc|def)?)? (abc|def){2,1000} produces ... ● Be careful with wildcards re.findall(r'(ab).*(cd).*(ef)', 'ab cd ef') # slower re.findall(r'(ab)s(cd)s(ef)', 'ab cd ef') # faster ● Longer target string -> slower regex matching
  • 39. Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi)
  • 40. Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY)
  • 41. Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY) TRAFFIC_(ALLOW|DROP|DENY)
  • 42. Regular Expressions Performance Strategies Writing regex ● Use the non-capturing group when no need to capture and save text to a variable (?:abc|def|ghi) instead of (abc|def|ghi) ● Pattern most likely to match first (TRAFFIC_ALLOW|TRAFFIC_DROP|TRAFFIC_DENY) TRAFFIC_(ALLOW|DROP|DENY) ● Use anchors (^ and $) to limit the score re.findall(r'(ab){2}', 'abcabcabc') re.findall(r'^(ab){2}','abcabcabc') #failures occur faster
  • 43. Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 44. Regular Expressions Performance Strategies Writing Agent plugins ● A new process is forked for each loaded plugin ○ Use the plugins that you really need!
  • 45. Regular Expressions Performance Strategies Writing Agent plugins ● A new process is forked for each loaded plugin ○ Use the plugins that you really need! ● A plugin is a set of rules (regexp operations) for matching log lines ○ If a plugin doesn't match a log entry, it fails in ALL its rules! ○ Reduce the number of rules, use a [translation] table
  • 46. Regular Expressions Performance Strategies Writing Agent plugins ● Alphabetical order for rule matching ○ Order your rules by priority, pattern most likely to match first
  • 47. Regular Expressions Performance Strategies Writing Agent plugins ● Alphabetical order for rule matching ○ Order your rules by priority, pattern most likely to match first ● Divide and conquer ○ A plugin is configured to read from a source file, use dedicated source files per technology ○ Also, use dedicated plugins for each technology
  • 48. Regular Expressions Performance Strategies Tool1 Tool2 Tool3 Tool4 Tool5 20 logs/sec 20 logs/sec 20 logs/sec 20 logs/sec 20 logs/sec /var/log/syslog (100 logs/sec) 5 plugins with 1 rule reading /var/log/syslog 5x100 = 500 total regex/sec
  • 49. Regular Expressions Performance Strategies Tool1 Tool2 Tool3 Tool4 Tool5 20 logs/sec 20 logs/sec 20 logs/sec 20 logs/sec 20 logs/sec /var/log/tool1 /var/log/tool2 /var/log/tool3 /var/log/tool4 /var/log/tool5 (100 logs/sec) 5 plugins with 1 rule reading /var/log/tool{1-5} 5x20 = 100 total regex/sec (x5) Faster
  • 50. Summary 1. What is a regexp? 2. When to use regexp? 3. Regex basics 4. Performance Tests 5. Writing regexp (Performance Strategies) 6. Writing plugins (Performance Strategies) 7. Tools
  • 51. Regular Expressions Tools for testing Regex Python: >>> import re >>> re.findall('(S+) (S+)', 'foo bar') [('foo', 'bar')] >>> result = re.search( ... '(?P<key>w+)s*=s*(?P<value>w+)', ... 'foo=bar' ... ) >>> result.groupdict() { 'key': 'foo', 'value': 'bar' }
  • 52. Regular Expressions Tools for testing Regex Regex debuggers: ● Kiki ● Kodos Online regex testers: ● http://gskinner.com/RegExr/ (java) ● http://regexpal.com/ (javascript) ● http://rubular.com/ (ruby) ● http://www.pythonregex.com/ (python) Online regex visualization: ● http://www.regexper.com/ (javascript)
  • 54. A3Sec web: www.a3sec.com email: training@a3sec.com twitter: @a3sec Spain Head Office C/ Aravaca, 6, Piso 2 28040 Madrid Tlf. +34 533 09 78 México Head Office Avda. Paseo de la Reforma, 389 Piso 10 México DF Tlf. +52 55 5980 3547