4. Routing
⢠We send data to ~210 different destinations
⢠Filters on the data specify which data should go
where
⢠often very detailed conditions on many ďŹelds
⢠Full routing tree has ~600 ďŹlter/transform/sink
nodes
4
5. Transforms
⢠Because GDPR we need to anonymize most incoming data
formats
⢠Some data has data quality issues that cannot be ďŹxed at
source, requires transforms to solve
⢠In many cases data needs to be transformed from one format to
another
⢠Pulse to Amplitude
⢠Pulse to Adobe Analytics
⢠ClickMeter to Pulse
⢠Convert data to match database structures
⢠âŚ
5
6. Who conďŹgures?
⢠Schibsted has >100 business units
⢠for Data Platform to do detailed conďŹguration for all of
these isnât going to scale
⢠for sites to do it themselves saves lots of time
⢠ConďŹguration requires domain knowledge
⢠each site has its own specialities in Pulse tracking
⢠to transform and route these correctly requires knowing all
this
6
9. What if?
⢠We had an expression language for JSON
⢠something like, say, XPath for JSON
⢠could write routing ďŹlters using that
⢠We had a tranformation language for JSON
⢠write as JSON template, using expression language to
compute values to insert
⢠A custom routing language for both batch and
streaming, based on this language
⢠designed for easy expressivity & deploy
9
10. ⢠Already existing query language for JSON
⢠https://stedolan.github.io/jq/
⢠Originally implemented in C
⢠there is a Java implementation, too
⢠Can do things like
⢠.foo
⢠.foo.bar
⢠.foo.bar > 25
⢠âŚ
10
14. Proof-of-concept
⢠Implement real-world transforms in this language
⢠before it was implemented
⢠Helped improve and solidify the design
⢠VeriďŹed that the language could do what we
needed
⢠Transforms looked quite reasonable.
14
15. A simple language
⢠JSON is written in JSON syntax
⢠evaluates to itself
⢠if <expr> <expr> else <expr>
⢠[for <expr> <expr>]
⢠let <name> = <expr>
⢠${ ⌠jq ⌠}
15
17. Stunt prototype
⢠Most of it implemented in two days
⢠Implemented in Scala
⢠using Antlr 3 to generate the parser
⢠jackson-jq for jq
⢠jackson for JSON
⢠A simple object tree interpreter
⢠Constructor.construct(Context, JsonNode) => JsonNode
17
22. The parser
⢠Code that takes a character stream and builds the
expression tree
⢠Use a parser generator to handle the difďŹcult part
⢠requires writing a grammar
⢠Parser generator produces Abstract Syntax Tree
⢠basically corresponds to the grammar structure
22
25. Language in use
⢠Implemented Data Quality Tooling using jq
⢠ďŹlters done in jq
⢠Implemented routing using jq ďŹlters
⢠and transforms in JSLT
⢠Wrote some transforms using the language
⢠anonymization of tracking data
⢠cleanup transforms to handle bad data
⢠âŚ
25
26. The good
⢠The language works
⢠proven by DQT, routing, and transforms
⢠Minimal implementation effort required
⢠Users approved of the language
⢠general agreement it was a major improvement
⢠people started writing their own transforms
26
27. The bad
⢠Performance could be better
⢠not horrible, but not great, either
⢠The ${ ⌠} wrappers are really ugly
⢠jq
⢠does not handle missing data well
⢠has dangerous features
⢠has weird and difďŹcult syntax for some things
⢠Too many dependencies
⢠Scala runtime (with versioning issues)
⢠Antlr runtime
27
28. 2.0
⢠Implement the complete language ourselves
⢠goodbye ${ ⌠}
⢠Get rid of the jq strangeness
⢠Add some new functionality
⢠Implement in pure Java with JavaCC
⢠JavaCC has no runtime dependencies
⢠only dependency is Jackson
28
29. JSLT expressions
29
.foo Get âfooâ key from input object
.foo.bar Get âfooâ, then â.barâ on that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
$baz.foo Variable reference
test(.foo, â^[a-z0-9]+$â) Functions (& regexps)
42. Turing-complete?
⢠Means that the language can express any
computation
⢠Itâs known that all thatâs required is
⢠conditionals (we have if tests)
⢠recursion (our functions can call themselves)
⢠But can this really be true?
42
43. N-queens
⢠Write a function that takes the size of the
chessboard and returns it with queens
⢠queens(4) =>
[
[ 0, 1, 0, 0 ],
[ 0, 0, 0, 1 ],
[ 1, 0, 0, 0 ],
[ 0, 0, 1, 0 ]
]
43https://github.com/schibsted/jslt/blob/master/examples/queens.jslt
45. Danger?
⢠Itâs possible to implement operations that run
forever
⢠But in practice the stack quickly gets too deep
⢠The JVM will then terminate the transform
45
46. Performance
⢠5-10 times faster than 1.0
⢠The main difference: no more jackson-jq
⢠jackson-jq is not very efďŹcient
⢠internal model is List<JsonNode>
⢠creates many unnecessary objects during evaluation
⢠does work at run-time that should be done at compile-time
46
47. JSLT improvements
⢠Value model is JsonNode
⢠can usually just return data from input object or from code
⢠EfďŹcient internal structures
⢠all collections are arrays
⢠very fast to traverse
⢠Boolean short-circuiting
⢠once we know the result, stop evaluating
⢠Cache regular expressions to avoid recompiling
47
48. The optimizer
⢠An optimizer is a function that takes an expression
and outputs an expression such that
⢠the new expression is at least as fast, and
⢠always outputs the same value
⢠Improves performance quite substantially even with
very simple techniques
48
51. Performance
⢠Test case: pulse-cleanup.jslt, real data, my laptop
⢠a complicated transform: 165 lines
⢠Transforms 132,000 events/second in one thread
⢠1.0 did about 20,000 events/second
51
52. Three strategies
⢠Syntax tree interpreter
⢠known to be the slowest approach
⢠Bytecode compiler with virtual machine
⢠C version of jq does this
⢠Java does that (until the JIT kicks in)
⢠Python does this
⢠Native compilation
⢠what JIT compiler in Java does
52
53. Designing a VM
53
Opcode Param
DUP
MKOBJ
CALL <func>
int[] bytecode;
JsonNode[] stack;
int top;
switch (opcode) {
case DUP:
stack[++top] = stack[top-1];
break;
case MKOBJ:
stack[++top] = mapper.createObjâŚ
break;
// âŚ
}
54. Compiler
⢠Traverse down the object tree
⢠emit bytecode as you go
⢠Stack nature of the VM matches object tree
structure
⢠each Expression produces code that leaves the value of
that expression on the stack
⢠Example:
⢠MKARR, <ďŹrst value>, ARRADD, <second>, ARRADD, âŚ
54
55. Prototype
⢠Stunt implemented over a couple of days
⢠Depressing result: object tree interpreter ~40%
faster
⢠Anthony Sparks tried the same thing
⢠original VM implementation 5x slower
⢠eventually managed to achieve performance parity
⢠So far: performance does not justify complexity
55
56. Java bytecode?
⢠The JVM is actually a stack-based VM
⢠can simply compile to Java bytecode instead
⢠Tricky to learn tools for generating bytecode
⢠no examples, very little documentation
⢠In the end decided to use the Asm library
⢠not very nice to use
⢠very primitive API
⢠crashes with NullPointerException on bad bytecode
56
59. Results
⢠Hard work to build
⢠many surprising issues in Java bytecode
⢠Performance boost of 15-25%
⢠code lives on jvm-bytecode branch in Github
⢠Ideas for how it could be even fasterâŚ
⢠through type inference
59
60. Type inference beneďŹts
"sdrn:" + $namespace + ":" + $rType + ":" + $rId
Plus
Plus(âsdrnâ $namespace)
Plus(
Plus(â:â $rType)
Plus(â:â $rId)
)
60
Plus:
JsonNode -> String
JsonNode -> String
String + String
new String -> new JsonNode
Will make 4 unnecessary TextNode objects
Will wrap and unwrap String repeatedly
Will check types unnecessarily
61. Solution
⢠+ operator can ask both sides: what type will you
produce?
⢠If one side says âstringâ then the result will be a string
⢠When compiling, do compile(generator, String)
⢠will compile code that produces a Java String object
⢠+ operator will make a new String if thatâs whatâs
wanted
⢠or turn it into a TextNode if the context wants Any
61
62. Freedom from Jackson
⢠The current codebase is bound to Jackson
⢠JVM bytecode compilation might be a way to
escape that
⢠Could build compilers that can interface with
different JSON representations
⢠Have ideas for a more efďŹcient JSON representation
⢠basically encode everything as arrays of ints
⢠should save memory, GC, and produce faster code
62
63. Freedom from JSON
⢠If we arenât bound to Jackson, why should we be
bound to JSON?
⢠Could support Avro, too
⢠Perhaps also other formats
63
65. Internal status
⢠JSLT now used in
⢠Data Quality Tooling (to express tests on data)
⢠routing ďŹlters
⢠transforms
⢠In Schibsted we have
⢠52 transforms, 2370 lines of code
⢠written by many people in different parts of the company
⢠Data Platform runs ~11 billion transforms/day
65
66. Open source status
⢠Released in June
⢠People are using it for real
⢠one certain case, several more examples
⢠details unknown
⢠Useful contributions from outsiders
⢠several bug ďŹxes to datetime/number handling
⢠Two alternative implementations being worked on
⢠one in .NET
⢠one is virtual machine-based in Java
66
67. Lessons learned
⢠A custom language can make life much simpler
⢠if it ďŹts the use case well
⢠Implementing a language is easier than it seems
⢠basically doable in a week
⢠Designing a language is not easy
⢠unfortunately
67