SlideShare ist ein Scribd-Unternehmen logo
1 von 102
Downloaden Sie, um offline zu lesen
Virtual Machine
for Regular Expressions
Alexander Yakushev
@unlog1c
JEEConf 2018
Theory
Stephen Cole Kleene - Inventor of regular expressions
^ This guy
What's a regular expression?
A text-matching automaton.
/^a+b?(cd*|e)$/
"aaaaabcddd" ✓
"aabbdd" ❌
Regular expression as FSM
^a+b?(cd*|e)$
Regular expression as FSM
^a+b?(cd*|e)$
Why implement regular
expressions yourself?
All modern programming languages provide
regular expressions with core library.
Why implement regular
expressions yourself?
All modern programming languages provide
regular expressions with core library.
But those are only character-level regexps.
At Grammarly, we need to define token-level rules
that are describable with regex semantics.
Why implement regular
expressions yourself?
<person> = (<honorific>? <first-name-dict>+ <last-name-dict>) |
(<determiner>? <profession-dict>)
Ways of implementing regex engines
1. Backtracking
○ Runtime: exponential
Backtracking implementation (by Rob Pike)
int match(char *regexp, char *text) {
if (regexp[0] == '0')
return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, text);
if (*text!='0' && (regexp[0]=='.' || regexp[0]==*text))
return match(regexp+1, text+1);
return 0;
}
int matchstar(char c, char *regexp, char *text) {
do {
if (match(regexp, text))
return 1;
} while (*text != '0' && (*text++ == c || c == '.'));
return 0;
}
Backtracking implementation
● Good: simple and short.
○ Fit into 15 lines!
● Bad: exponential complexity.
How to kill a snake
$ python
>>> import re
>>> s = "a" * 50
>>> re.match("(a|aa)*b", s)
# crickets… chirp chirp
Ways of implementing regex engines
1. Backtracking
○ Runtime: exponential
2. Full FSM unroll (static NFA->DFA)
○ Compilation: exponential time and memory
○ Runtime: linear O(n)
Ways of implementing regex engines
1. Backtracking
○ Runtime: exponential
2. Full FSM unroll (static NFA->DFA)
○ Compilation: exponential time and memory
○ Runtime: linear O(n)
3. Dynamic FSM unroll (“lazy” DFA construction)
○ Runtime: linear O(nm)
Dynamic FSM unroll
1. Google RE2*
2. Rust’s regex library†
3. Virtual machine approach‡
* github.com/google/re2
† github.com/rust-lang/regex
‡
swtch.com/~rsc/regexp/regexp2.html
Virtual machines
MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Registers
Crude example of an x86 machine
MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers
AX 321
BX 432
CX 0
...
IP 7
Thread 2 Registers
AX 879
BX 567
CX 4
...
IP 3
Thread 3 Registers
Crude example of an x86 machine
MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers Thread 2 Registers
AX 321
BX 432
CX 0
...
IP 7
Crude example of a virtual machine
Examples of virtual machines
1. VirtualBox/VMWare/KVM
2. Java Virtual Machine
3. Domain-specific VMs
TrexVM (Token RegEX Virtual Machine)
● Consumes input sequence token by token.
○ Never goes back to previous tokens.
● IP (instruction pointer) tracks the currently executed instruction.
● Instructions:
○ CMP x
Compare current token to x, increment IP if equal, fail if not.
○ JUMP label
Unconditionally set IP to the point designated by label.
○ FORK label
Increment IP and spawn additional thread that jumps to label.
TrexVM (Token RegEX Virtual Machine)
● All threads are executed in a lock-step.
○ Execute all CMP instructions simultaneously.
○ If some threads point not to CMP, statically unroll them.
● Execution continues until one thread reaches the end of the
program (successful match) or all threads are dead (failed
match)
TrexVM sample run
aaab
Input string
↑
1→ L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑1→
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑
1→
2→ L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑2→
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑
2→
3→ L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑3→
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
L1: CMP a
FORK L1
CMP b
Program
aaab
Input string
↑
3→
4→
TrexVM sample run
aaab
Input string
↑
3→ Success!
L1: CMP a
FORK L1
CMP b
Program
TrexVM sample run
aaab
Input string
↑
3→ Success!
Regex: a+b
L1: CMP a
FORK L1
CMP b
Program
TrexVM
FORK L1
CMP a
L1: FORK L2
CMP b
JUMP L1
L2: CMP c
Guess the regular expression:
TrexVM
FORK L1
CMP a
L1: FORK L2
CMP b
JUMP L1
L2: CMP c
Regex: a?b*c
Guess the regular expression:
TrexVM: next iteration
Added support for match groups.
Each thread now has a register bank (beyond just IP register).
New instructions:
● SAVEL group — save current position in input as beginning of group.
● SAVER group — save current position in input as ending of group.
● FORKSTAY label — FORK which gives the staying thread a higher
priority.
● FORKJUMP label — FORK which gives the jumping thread a higher priority.
Matching stops when the thread with the highest priority succeeds.
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd 1
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd
T2
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd 2
T2
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 2
snd 2
T2
2→
1→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0
snd
T2
1→
2→
3→
Reg L R
fst 0 2
snd 2 3
T3
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd 3
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b
SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b
Regex: ([ab]+)(b)
Another run
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1
T2
3→
Reg L R
fst 0
snd
T3
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1 1
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1 1
T2
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b
SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b
Regex: ([ab]+?)(b?)
Implementation
Concatenation
#INCLUDE regex1
#INCLUDE regex2
...
#INCLUDE regexN
(cat regex1 regex2 … regexN)
Quantifiers
#LABEL _LOOP
#INCLUDE regex
FORKJUMP _LOOP
(+ regex)
#LABEL _LOOP
#INCLUDE regex
FORKSTAY _LOOP
(+? regex)
Quantifiers
#LABEL _START
FORKSTAY _END
#INCLUDE regex
JUMP _START
#LABEL _END
(* regex)
#LABEL _START
FORKJUMP _END
#INCLUDE regex
JUMP _START
#LABEL _END
(*? regex)
Quantifiers
FORKSTAY _SKIP
#INCLUDE regex
#LABEL _SKIP
(? regex)
FORKJUMP _SKIP
#INCLUDE regex
#LABEL _SKIP
(?? regex)
Alternation
FORKSTAY _ALT
#INCLUDE regex1
JUMP _END
#LABEL _ALT
#INCLUDE regex2
#LABEL _END
(| regex1 regex2)
Match groups
SAVEL name
#INCLUDE regex
SAVER name
(as group regex)
Possible CMP arguments
● String (check if token is equal)
● Char-level regex (check if token matches)
● Hashset/map (check if it contains the token)
● Arbitrary predicate (check if token satisfies)
TrexVM syntax
(as :person
(| (cat (? is-honorific) (+ first-name-dict) last-name-dict))
(cat (? is-determiner) profession-dict))
Now we can write regular expressions like this:
Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads
Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads
Implemented as a subvirtual
machine inside each thread.
Start/end of sequence anchors (^ and $)
(cat < is-number (? is-word) >)
Extra features
Extra features
(cat (*? .) regex…)
Cheap anchor-free macthing
Loop detection
(rx-find (cat (* (| a (cat a a))) b)
(repeat 300 a))
=> nil
Extra features
Composability
(def numbered-street-name
(+ (| is-ordinal-number street-name-dict))
(def street
(| (cat (? #"d+") numbered-street-name street-marker-dict)
(cat #"d+" numbered-street-name #{"Count" "Drive"}))
Extra features
Shortcomings
● No look-behinds
● No backreferences
● Can’t find all matches (only first match)
● Overkill complexity for trivial cases
Under the hood
Naive implementation
● 300 lines of Clojure
● Immutable VM and Thread objects
● Very concise and debuggable code
Naive implementation
● 300 lines of Clojure
● Immutable VM and Thread objects
● Very concise and debuggable code
● Slow!
○ Real-world regex scan takes ~1ms
per sentence
Optimizations
Inline caching of compiled regexes.
Bad Java regexp usage
for (String sentence : sentences) {
sentence.matches("^(?i[rea]lly* hard+ regex))$");
}
Good Java regexp usage
Pattern p = Pattern.compile("^(?i[rea]lly* hard+ regex))$");
for (String sentence : sentences) {
p.matcher(sentence).matches();
}
Clojure: the power of macros
(for [sentence-tokens all-sentences]
(find (trex (+ (? company-name)) ...)
sentence-tokens))
Optimizations
Inline caching of compiled regexes.
3x performance improvement.
Optimizations
● Started rewriting parts of the
implementation into Java.
Optimizations
● Started rewriting parts of the
implementation into Java.
● Made some objects mutable.
Optimizations
● Started rewriting parts of the
implementation into Java.
● Made some objects mutable.
● Made everything mutable.
Optimizations
● Started rewriting parts of the
implementation into Java.
● Made some objects mutable.
● Made everything mutable.
● Made some things immutable again.
● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
Final version
● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
● Mix of mutable and immutable objects with copy-on-write
fields.
Final version
● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
● Mix of mutable and immutable objects with copy-on-write
fields.
● Performance x20 of the initial version.
○ Previous regex takes 50μs per sentence.
Final version
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
○ Make it vulnerable to Meltdown/Spectre.
Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
○ Make it vulnerable to Meltdown/Spectre.
● Investigate if look-behinds are possible
(through hacks and reduced perf)
Conclusions
● Old papers contain great ideas
● Knowledge from university can be useful
● Make it work, then make it fast
References
● Original paper:
https://swtch.com/~rsc/regexp/regexp2.html
● Open-source implementation (in Clojure):
https://github.com/cgrand/seqexp
● Synacor Challenge
https://challenge.synacor.com
.+the end$

Weitere ähnliche Inhalte

Was ist angesagt?

Staring into the eBPF Abyss
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF Abyss
Sasha Goldshtein
 

Was ist angesagt? (20)

Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
Good news, everybody! Guile 2.2 performance notes (FOSDEM 2016)
 
Introduction to gdb
Introduction to gdbIntroduction to gdb
Introduction to gdb
 
Performance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVEPerformance evaluation with Arm HPC tools for SVE
Performance evaluation with Arm HPC tools for SVE
 
C Under Linux
C Under LinuxC Under Linux
C Under Linux
 
Porting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVEPorting and Optimization of Numerical Libraries for ARM SVE
Porting and Optimization of Numerical Libraries for ARM SVE
 
Berkeley Packet Filters
Berkeley Packet FiltersBerkeley Packet Filters
Berkeley Packet Filters
 
ocelot
ocelotocelot
ocelot
 
Arm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler supportArm tools and roadmap for SVE compiler support
Arm tools and roadmap for SVE compiler support
 
Staring into the eBPF Abyss
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF Abyss
 
Javascript Secrets - Front in Floripa 2015
Javascript Secrets - Front in Floripa 2015Javascript Secrets - Front in Floripa 2015
Javascript Secrets - Front in Floripa 2015
 
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler OptimizationsPragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
Pragmatic Optimization in Modern Programming - Mastering Compiler Optimizations
 
Gnu debugger
Gnu debuggerGnu debugger
Gnu debugger
 
Low pause GC in HotSpot
Low pause GC in HotSpotLow pause GC in HotSpot
Low pause GC in HotSpot
 
Code GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flowCode GPU with CUDA - Optimizing memory and control flow
Code GPU with CUDA - Optimizing memory and control flow
 
LTO plugin
LTO pluginLTO plugin
LTO plugin
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)不深不淺,帶你認識 LLVM (Found LLVM in your life)
不深不淺,帶你認識 LLVM (Found LLVM in your life)
 
Knowing your Garbage Collector / Python Madrid
Knowing your Garbage Collector / Python MadridKnowing your Garbage Collector / Python Madrid
Knowing your Garbage Collector / Python Madrid
 
Introduction to RevKit
Introduction to RevKitIntroduction to RevKit
Introduction to RevKit
 
Exercice.docx
Exercice.docxExercice.docx
Exercice.docx
 

Ähnlich wie Virtual Machine for Regular Expressions

Implement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdfImplement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
meerobertsonheyde608
 

Ähnlich wie Virtual Machine for Regular Expressions (20)

Advanced procedures in assembly language Full chapter ppt
Advanced procedures in assembly language Full chapter pptAdvanced procedures in assembly language Full chapter ppt
Advanced procedures in assembly language Full chapter ppt
 
Lecture 3 RE NFA DFA
Lecture 3   RE NFA DFA Lecture 3   RE NFA DFA
Lecture 3 RE NFA DFA
 
Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
 
lec15_x86procedure_4up.pdf
lec15_x86procedure_4up.pdflec15_x86procedure_4up.pdf
lec15_x86procedure_4up.pdf
 
07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation07 140430-ipp-languages used in llvm during compilation
07 140430-ipp-languages used in llvm during compilation
 
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdfImplement an MPI program to perform matrix-matrix multiplication AB .pdf
Implement an MPI program to perform matrix-matrix multiplication AB .pdf
 
Exploitation Crash Course
Exploitation Crash CourseExploitation Crash Course
Exploitation Crash Course
 
R/C++ talk at earl 2014
R/C++ talk at earl 2014R/C++ talk at earl 2014
R/C++ talk at earl 2014
 
System Hacking Tutorial #2 - Buffer Overflow - Overwrite EIP
System Hacking Tutorial #2 - Buffer Overflow - Overwrite EIPSystem Hacking Tutorial #2 - Buffer Overflow - Overwrite EIP
System Hacking Tutorial #2 - Buffer Overflow - Overwrite EIP
 
Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012
 
Make ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEKMake ARM Shellcode Great Again - HITB2018PEK
Make ARM Shellcode Great Again - HITB2018PEK
 
Node.js - Advanced Basics
Node.js - Advanced BasicsNode.js - Advanced Basics
Node.js - Advanced Basics
 
Assembly class
Assembly classAssembly class
Assembly class
 
other-architectures.ppt
other-architectures.pptother-architectures.ppt
other-architectures.ppt
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Assembly language programming_fundamentals 8086
Assembly language programming_fundamentals 8086Assembly language programming_fundamentals 8086
Assembly language programming_fundamentals 8086
 
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
[FT-11][suhorng] “Poor Man's” Undergraduate Compilers
 
Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2Visual Studio를 이용한 어셈블리어 학습 part 2
Visual Studio를 이용한 어셈블리어 학습 part 2
 
Al2ed chapter17
Al2ed chapter17Al2ed chapter17
Al2ed chapter17
 
C Programming Homework Help
C Programming Homework HelpC Programming Homework Help
C Programming Homework Help
 

Kürzlich hochgeladen

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Kürzlich hochgeladen (20)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

Virtual Machine for Regular Expressions

  • 1. Virtual Machine for Regular Expressions Alexander Yakushev @unlog1c JEEConf 2018
  • 3. Stephen Cole Kleene - Inventor of regular expressions ^ This guy
  • 4. What's a regular expression? A text-matching automaton. /^a+b?(cd*|e)$/ "aaaaabcddd" ✓ "aabbdd" ❌
  • 5. Regular expression as FSM ^a+b?(cd*|e)$
  • 6. Regular expression as FSM ^a+b?(cd*|e)$
  • 7. Why implement regular expressions yourself? All modern programming languages provide regular expressions with core library.
  • 8. Why implement regular expressions yourself? All modern programming languages provide regular expressions with core library. But those are only character-level regexps.
  • 9. At Grammarly, we need to define token-level rules that are describable with regex semantics. Why implement regular expressions yourself? <person> = (<honorific>? <first-name-dict>+ <last-name-dict>) | (<determiner>? <profession-dict>)
  • 10. Ways of implementing regex engines 1. Backtracking ○ Runtime: exponential
  • 11. Backtracking implementation (by Rob Pike) int match(char *regexp, char *text) { if (regexp[0] == '0') return 1; if (regexp[1] == '*') return matchstar(regexp[0], regexp+2, text); if (*text!='0' && (regexp[0]=='.' || regexp[0]==*text)) return match(regexp+1, text+1); return 0; } int matchstar(char c, char *regexp, char *text) { do { if (match(regexp, text)) return 1; } while (*text != '0' && (*text++ == c || c == '.')); return 0; }
  • 12. Backtracking implementation ● Good: simple and short. ○ Fit into 15 lines! ● Bad: exponential complexity.
  • 13. How to kill a snake $ python >>> import re >>> s = "a" * 50 >>> re.match("(a|aa)*b", s) # crickets… chirp chirp
  • 14. Ways of implementing regex engines 1. Backtracking ○ Runtime: exponential 2. Full FSM unroll (static NFA->DFA) ○ Compilation: exponential time and memory ○ Runtime: linear O(n)
  • 15. Ways of implementing regex engines 1. Backtracking ○ Runtime: exponential 2. Full FSM unroll (static NFA->DFA) ○ Compilation: exponential time and memory ○ Runtime: linear O(n) 3. Dynamic FSM unroll (“lazy” DFA construction) ○ Runtime: linear O(nm)
  • 16. Dynamic FSM unroll 1. Google RE2* 2. Rust’s regex library† 3. Virtual machine approach‡ * github.com/google/re2 † github.com/rust-lang/regex ‡ swtch.com/~rsc/regexp/regexp2.html
  • 18. MOV CX, 10 MOV AX, [BP] L: CMP CX, 0 JZ E SHL AX, 1 DEC CX JMP L E: RET Machine code (actually, assembly) AX 123 BX 234 CX 9 ... IP 0 Registers Crude example of an x86 machine
  • 19. MOV CX, 10 MOV AX, [BP] L: CMP CX, 0 JZ E SHL AX, 1 DEC CX JMP L E: RET Machine code (actually, assembly) AX 123 BX 234 CX 9 ... IP 0 Thread 1 Registers AX 321 BX 432 CX 0 ... IP 7 Thread 2 Registers AX 879 BX 567 CX 4 ... IP 3 Thread 3 Registers Crude example of an x86 machine
  • 20. MOV CX, 10 MOV AX, [BP] L: CMP CX, 0 JZ E SHL AX, 1 DEC CX JMP L E: RET Machine code (actually, assembly) AX 123 BX 234 CX 9 ... IP 0 Thread 1 Registers Thread 2 Registers AX 321 BX 432 CX 0 ... IP 7 Crude example of a virtual machine
  • 21. Examples of virtual machines 1. VirtualBox/VMWare/KVM 2. Java Virtual Machine 3. Domain-specific VMs
  • 22. TrexVM (Token RegEX Virtual Machine) ● Consumes input sequence token by token. ○ Never goes back to previous tokens. ● IP (instruction pointer) tracks the currently executed instruction. ● Instructions: ○ CMP x Compare current token to x, increment IP if equal, fail if not. ○ JUMP label Unconditionally set IP to the point designated by label. ○ FORK label Increment IP and spawn additional thread that jumps to label.
  • 23. TrexVM (Token RegEX Virtual Machine) ● All threads are executed in a lock-step. ○ Execute all CMP instructions simultaneously. ○ If some threads point not to CMP, statically unroll them. ● Execution continues until one thread reaches the end of the program (successful match) or all threads are dead (failed match)
  • 24. TrexVM sample run aaab Input string ↑ 1→ L1: CMP a FORK L1 CMP b Program
  • 25. TrexVM sample run aaab Input string ↑1→ L1: CMP a FORK L1 CMP b Program
  • 26. TrexVM sample run aaab Input string ↑ 1→ 2→ L1: CMP a FORK L1 CMP b Program
  • 27. TrexVM sample run aaab Input string ↑2→ L1: CMP a FORK L1 CMP b Program
  • 28. TrexVM sample run aaab Input string ↑ 2→ 3→ L1: CMP a FORK L1 CMP b Program
  • 29. TrexVM sample run aaab Input string ↑3→ L1: CMP a FORK L1 CMP b Program
  • 30. TrexVM sample run L1: CMP a FORK L1 CMP b Program aaab Input string ↑ 3→ 4→
  • 31. TrexVM sample run aaab Input string ↑ 3→ Success! L1: CMP a FORK L1 CMP b Program
  • 32. TrexVM sample run aaab Input string ↑ 3→ Success! Regex: a+b L1: CMP a FORK L1 CMP b Program
  • 33. TrexVM FORK L1 CMP a L1: FORK L2 CMP b JUMP L1 L2: CMP c Guess the regular expression:
  • 34. TrexVM FORK L1 CMP a L1: FORK L2 CMP b JUMP L1 L2: CMP c Regex: a?b*c Guess the regular expression:
  • 35. TrexVM: next iteration Added support for match groups. Each thread now has a register bank (beyond just IP register). New instructions: ● SAVEL group — save current position in input as beginning of group. ● SAVER group — save current position in input as ending of group. ● FORKSTAY label — FORK which gives the staying thread a higher priority. ● FORKJUMP label — FORK which gives the jumping thread a higher priority. Matching stops when the thread with the highest priority succeeds.
  • 36. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst snd T1 1→
  • 37. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→
  • 38. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→
  • 39. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 snd T2
  • 40. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 1 snd T2
  • 41. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 1 snd 1 T2
  • 42. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→
  • 43. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ 2→ Reg L R fst 0 snd T2
  • 44. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ Reg L R fst 0 2 snd T2 2→
  • 45. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 1→ Reg L R fst 0 2 snd 2 T2 2→
  • 46. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 2 snd 2 T2 2→ 1→
  • 47. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 snd T2 1→ 2→ 3→ Reg L R fst 0 2 snd 2 3 T3
  • 48. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 3 snd T2 1→ 3→ Reg L R fst 0 2 snd 2 3 T3 2→
  • 49. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers Reg L R fst 0 snd T1 Reg L R fst 0 3 snd 3 T2 1→ 3→ Reg L R fst 0 2 snd 2 3 T3 2→
  • 50. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers 3→ Reg L R fst 0 2 snd 2 3 T3 Success! Match groups: fst: aa snd: b
  • 51. SAVEL fst L1: CMP [ab] FORKJUMP L1 SAVER fst SAVEL snd CMP b SAVER snd Program aab Input string ↑ TrexVM1.1 sample run Thread registers 3→ Reg L R fst 0 2 snd 2 3 T3 Success! Match groups: fst: aa snd: b Regex: ([ab]+)(b)
  • 53. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string ↑ TrexVM1.1 second run Thread registers Reg L R fst snd T1 1→
  • 54. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string ↑ TrexVM1.1 second run Thread registers Reg L R fst 0 snd T1 1→
  • 55. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string ↑ TrexVM1.1 second run Thread registers Reg L R fst 0 snd T1 1→
  • 56. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 snd T1 1→ 2→ ↑ Reg L R fst 0 snd T2
  • 57. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd T1 1→ 2→ ↑ Reg L R fst 0 snd T2
  • 58. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 snd T2
  • 59. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 1 snd 1 T2 3→ Reg L R fst 0 snd T3
  • 60. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 1 snd 1 1 T2
  • 61. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 T1 1→ 2→ ↑ Reg L R fst 0 1 snd 1 1 T2
  • 62. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 2 T1 1→ ↑
  • 63. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 2 T1 1→ ↑ Success! Match groups: fst: a snd: b
  • 64. SAVEL fst L1: CMP [ab] FORKSTAY L1 SAVER fst SAVEL snd FORKSTAY L2 CMP b L2: SAVER snd Program ab Input string TrexVM1.1 second run Thread registers Reg L R fst 0 1 snd 1 2 T1 1→ ↑ Success! Match groups: fst: a snd: b Regex: ([ab]+?)(b?)
  • 66. Concatenation #INCLUDE regex1 #INCLUDE regex2 ... #INCLUDE regexN (cat regex1 regex2 … regexN)
  • 67. Quantifiers #LABEL _LOOP #INCLUDE regex FORKJUMP _LOOP (+ regex) #LABEL _LOOP #INCLUDE regex FORKSTAY _LOOP (+? regex)
  • 68. Quantifiers #LABEL _START FORKSTAY _END #INCLUDE regex JUMP _START #LABEL _END (* regex) #LABEL _START FORKJUMP _END #INCLUDE regex JUMP _START #LABEL _END (*? regex)
  • 69. Quantifiers FORKSTAY _SKIP #INCLUDE regex #LABEL _SKIP (? regex) FORKJUMP _SKIP #INCLUDE regex #LABEL _SKIP (?? regex)
  • 70. Alternation FORKSTAY _ALT #INCLUDE regex1 JUMP _END #LABEL _ALT #INCLUDE regex2 #LABEL _END (| regex1 regex2)
  • 71. Match groups SAVEL name #INCLUDE regex SAVER name (as group regex)
  • 72. Possible CMP arguments ● String (check if token is equal) ● Char-level regex (check if token matches) ● Hashset/map (check if it contains the token) ● Arbitrary predicate (check if token satisfies)
  • 73. TrexVM syntax (as :person (| (cat (? is-honorific) (+ first-name-dict) last-name-dict)) (cat (? is-determiner) profession-dict)) Now we can write regular expressions like this:
  • 74. Extra features (cat (+ name-dict) (?! is-stop-word)) Negative and positive (nested) look aheads
  • 75. Extra features (cat (+ name-dict) (?! is-stop-word)) Negative and positive (nested) look aheads Implemented as a subvirtual machine inside each thread.
  • 76. Start/end of sequence anchors (^ and $) (cat < is-number (? is-word) >) Extra features
  • 77. Extra features (cat (*? .) regex…) Cheap anchor-free macthing
  • 78. Loop detection (rx-find (cat (* (| a (cat a a))) b) (repeat 300 a)) => nil Extra features
  • 79. Composability (def numbered-street-name (+ (| is-ordinal-number street-name-dict)) (def street (| (cat (? #"d+") numbered-street-name street-marker-dict) (cat #"d+" numbered-street-name #{"Count" "Drive"})) Extra features
  • 80. Shortcomings ● No look-behinds ● No backreferences ● Can’t find all matches (only first match) ● Overkill complexity for trivial cases
  • 82. Naive implementation ● 300 lines of Clojure ● Immutable VM and Thread objects ● Very concise and debuggable code
  • 83. Naive implementation ● 300 lines of Clojure ● Immutable VM and Thread objects ● Very concise and debuggable code ● Slow! ○ Real-world regex scan takes ~1ms per sentence
  • 84. Optimizations Inline caching of compiled regexes.
  • 85. Bad Java regexp usage for (String sentence : sentences) { sentence.matches("^(?i[rea]lly* hard+ regex))$"); }
  • 86. Good Java regexp usage Pattern p = Pattern.compile("^(?i[rea]lly* hard+ regex))$"); for (String sentence : sentences) { p.matcher(sentence).matches(); }
  • 87. Clojure: the power of macros (for [sentence-tokens all-sentences] (find (trex (+ (? company-name)) ...) sentence-tokens))
  • 88. Optimizations Inline caching of compiled regexes. 3x performance improvement.
  • 89. Optimizations ● Started rewriting parts of the implementation into Java.
  • 90. Optimizations ● Started rewriting parts of the implementation into Java. ● Made some objects mutable.
  • 91. Optimizations ● Started rewriting parts of the implementation into Java. ● Made some objects mutable. ● Made everything mutable.
  • 92. Optimizations ● Started rewriting parts of the implementation into Java. ● Made some objects mutable. ● Made everything mutable. ● Made some things immutable again.
  • 93. ● 300 lines of Java ○ VM completely rewritten in Java. ● 300 lines of Clojure ○ Function definitions, compiler, API (find, matches, …) Final version
  • 94. ● 300 lines of Java ○ VM completely rewritten in Java. ● 300 lines of Clojure ○ Function definitions, compiler, API (find, matches, …) ● Mix of mutable and immutable objects with copy-on-write fields. Final version
  • 95. ● 300 lines of Java ○ VM completely rewritten in Java. ● 300 lines of Clojure ○ Function definitions, compiler, API (find, matches, …) ● Mix of mutable and immutable objects with copy-on-write fields. ● Performance x20 of the initial version. ○ Previous regex takes 50μs per sentence. Final version
  • 96. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase.
  • 97. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase. ● JIT compiler and branch prediction ○ Leverage runtime knowledge about often-failed CMPs in the regex.
  • 98. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase. ● JIT compiler and branch prediction ○ Leverage runtime knowledge about often-failed CMPs in the regex. ○ Make it vulnerable to Meltdown/Spectre.
  • 99. Future work ● Improve the performance for trivial cases. ○ More static analysis and optimization in regex compilation phase. ● JIT compiler and branch prediction ○ Leverage runtime knowledge about often-failed CMPs in the regex. ○ Make it vulnerable to Meltdown/Spectre. ● Investigate if look-behinds are possible (through hacks and reduced perf)
  • 100. Conclusions ● Old papers contain great ideas ● Knowledge from university can be useful ● Make it work, then make it fast
  • 101. References ● Original paper: https://swtch.com/~rsc/regexp/regexp2.html ● Open-source implementation (in Clojure): https://github.com/cgrand/seqexp ● Synacor Challenge https://challenge.synacor.com