Slides from JEEConf 2018 talk "Virtual Machine for Regular Expressions". It describes how and why to implement a custom regular expression engine for matching arbitrary sequences.
8. Why implement regular
expressions yourself?
All modern programming languages provide
regular expressions with core library.
But those are only character-level regexps.
9. At Grammarly, we need to define token-level rules
that are describable with regex semantics.
Why implement regular
expressions yourself?
<person> = (<honorific>? <first-name-dict>+ <last-name-dict>) |
(<determiner>? <profession-dict>)
18. MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Registers
Crude example of an x86 machine
19. MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers
AX 321
BX 432
CX 0
...
IP 7
Thread 2 Registers
AX 879
BX 567
CX 4
...
IP 3
Thread 3 Registers
Crude example of an x86 machine
20. MOV CX, 10
MOV AX, [BP]
L: CMP CX, 0
JZ E
SHL AX, 1
DEC CX
JMP L
E: RET
Machine code (actually, assembly)
AX 123
BX 234
CX 9
...
IP 0
Thread 1 Registers Thread 2 Registers
AX 321
BX 432
CX 0
...
IP 7
Crude example of a virtual machine
22. TrexVM (Token RegEX Virtual Machine)
● Consumes input sequence token by token.
○ Never goes back to previous tokens.
● IP (instruction pointer) tracks the currently executed instruction.
● Instructions:
○ CMP x
Compare current token to x, increment IP if equal, fail if not.
○ JUMP label
Unconditionally set IP to the point designated by label.
○ FORK label
Increment IP and spawn additional thread that jumps to label.
23. TrexVM (Token RegEX Virtual Machine)
● All threads are executed in a lock-step.
○ Execute all CMP instructions simultaneously.
○ If some threads point not to CMP, statically unroll them.
● Execution continues until one thread reaches the end of the
program (successful match) or all threads are dead (failed
match)
34. TrexVM
FORK L1
CMP a
L1: FORK L2
CMP b
JUMP L1
L2: CMP c
Regex: a?b*c
Guess the regular expression:
35. TrexVM: next iteration
Added support for match groups.
Each thread now has a register bank (beyond just IP register).
New instructions:
● SAVEL group — save current position in input as beginning of group.
● SAVER group — save current position in input as ending of group.
● FORKSTAY label — FORK which gives the staying thread a higher
priority.
● FORKJUMP label — FORK which gives the jumping thread a higher priority.
Matching stops when the thread with the highest priority succeeds.
36. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst
snd
T1
1→
37. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
38. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
39. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0
snd
T2
40. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd
T2
41. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0 1
snd 1
T2
42. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
43. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→
Reg L R
fst 0
snd
T2
44. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd
T2
2→
45. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
1→
Reg L R
fst 0 2
snd 2
T2
2→
46. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 2
snd 2
T2
2→
1→
47. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0
snd
T2
1→
2→
3→
Reg L R
fst 0 2
snd 2 3
T3
48. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→
49. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
Reg L R
fst 0
snd
T1
Reg L R
fst 0 3
snd 3
T2
1→
3→
Reg L R
fst 0 2
snd 2 3
T3
2→
50. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b
51. SAVEL fst
L1: CMP [ab]
FORKJUMP L1
SAVER fst
SAVEL snd
CMP b
SAVER snd
Program
aab
Input string
↑
TrexVM1.1 sample run
Thread registers
3→
Reg L R
fst 0 2
snd 2 3
T3
Success! Match groups: fst: aa snd: b
Regex: ([ab]+)(b)
53. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst
snd
T1
1→
54. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
55. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
↑
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
56. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
57. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
58. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→ ↑
Reg L R
fst 0
snd
T2
59. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1
T2
3→
Reg L R
fst 0
snd
T3
60. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1 1
T2
61. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1
T1
1→
2→
↑
Reg L R
fst 0 1
snd 1 1
T2
62. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
63. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b
64. SAVEL fst
L1: CMP [ab]
FORKSTAY L1
SAVER fst
SAVEL snd
FORKSTAY L2
CMP b
L2: SAVER snd
Program
ab
Input string
TrexVM1.1 second run
Thread registers
Reg L R
fst 0 1
snd 1 2
T1
1→
↑
Success! Match groups: fst: a snd: b
Regex: ([ab]+?)(b?)
72. Possible CMP arguments
● String (check if token is equal)
● Char-level regex (check if token matches)
● Hashset/map (check if it contains the token)
● Arbitrary predicate (check if token satisfies)
73. TrexVM syntax
(as :person
(| (cat (? is-honorific) (+ first-name-dict) last-name-dict))
(cat (? is-determiner) profession-dict))
Now we can write regular expressions like this:
74. Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads
75. Extra features
(cat (+ name-dict) (?! is-stop-word))
Negative and positive (nested) look aheads
Implemented as a subvirtual
machine inside each thread.
76. Start/end of sequence anchors (^ and $)
(cat < is-number (? is-word) >)
Extra features
82. Naive implementation
● 300 lines of Clojure
● Immutable VM and Thread objects
● Very concise and debuggable code
83. Naive implementation
● 300 lines of Clojure
● Immutable VM and Thread objects
● Very concise and debuggable code
● Slow!
○ Real-world regex scan takes ~1ms
per sentence
92. Optimizations
● Started rewriting parts of the
implementation into Java.
● Made some objects mutable.
● Made everything mutable.
● Made some things immutable again.
93. ● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
Final version
94. ● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
● Mix of mutable and immutable objects with copy-on-write
fields.
Final version
95. ● 300 lines of Java
○ VM completely rewritten in Java.
● 300 lines of Clojure
○ Function definitions, compiler, API (find, matches, …)
● Mix of mutable and immutable objects with copy-on-write
fields.
● Performance x20 of the initial version.
○ Previous regex takes 50μs per sentence.
Final version
96. Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
97. Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
98. Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
○ Make it vulnerable to Meltdown/Spectre.
99. Future work
● Improve the performance for trivial cases.
○ More static analysis and optimization in
regex compilation phase.
● JIT compiler and branch prediction
○ Leverage runtime knowledge about
often-failed CMPs in the regex.
○ Make it vulnerable to Meltdown/Spectre.
● Investigate if look-behinds are possible
(through hacks and reduced perf)
100. Conclusions
● Old papers contain great ideas
● Knowledge from university can be useful
● Make it work, then make it fast