SlideShare ist ein Scribd-Unternehmen logo
1 von 68
Downloaden Sie, um offline zu lesen
JSLT: JSON query & transform
Lars Marius Garshol, lars.marius.garshol@schibsted.com
http://twitter.com/larsga 2018–09–12, JavaZone 2018
Data Platform
2
Data Platform
Batch
Streaming
Pulse
Data volume
3
Routing
• We send data to ~210 different destinations
• Filters on the data specify which data should go
where
• often very detailed conditions on many fields
• Full routing tree has ~600 filter/transform/sink
nodes
4
Transforms
• Because GDPR we need to anonymize most incoming data
formats
• Some data has data quality issues that cannot be fixed at
source, requires transforms to solve
• In many cases data needs to be transformed from one format to
another
• Pulse to Amplitude
• Pulse to Adobe Analytics
• ClickMeter to Pulse
• Convert data to match database structures
• …
5
Who congures?
• Schibsted has >100 business units
• for Data Platform to do detailed configuration for all of
these isn’t going to scale
• for sites to do it themselves saves lots of time
• Configuration requires domain knowledge
• each site has its own specialities in Pulse tracking
• to transform and route these correctly requires knowing all
this
6
Batch cong: 1 sink
{
"driver": "anyoflter",
"name": "image-classication",
"rules": [
{ "name": "ImageClassication", "key": "provider.component", "value": "ImageClassication" },
{ "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" }
],
"onmatch": [
{
"driver": "cache",
"name": "image-classication",
"level": "memory+disk"
},
{
"driver": "demuxer",
"name": "image-classication",
"rules": "${pulseSdrnFilterUri}",
"parallel": true,
"onmatch": {
"driver": "textlewriter",
"uri": "${imageSiteUri}",
"numFiles": {
"eventsPerFile": 500000,
"max": ${numExecutors}
}
}
}
],
"consume": true
7
Early cong was 1838 lines
} JSON matching
A real transform
8
What if?
• We had an expression language for JSON
• something like, say, XPath for JSON
• could write routing filters using that
• We had a tranformation language for JSON
• write as JSON template, using expression language to
compute values to insert
• A custom routing language for both batch and
streaming, based on this language
• designed for easy expressivity & deploy
9
• Already existing query language for JSON
• https://stedolan.github.io/jq/
• Originally implemented in C
• there is a Java implementation, too
• Can do things like
• .foo
• .foo.bar
• .foo.bar > 25
• …
10
First, fumbling attempt
{
“event_type” : “View”,
“insert_id” : {“__expr__” : “.object.id”},
“source” : {“__if__” : {
“test” : “.source”,
“then” : “.source.id”,
“else” : “.src”
}
}
11
Second, fumbling attempt
{
“event_type” : “View”,
“insert_id” : “${.object.id}”,
“source” : “${if (.source) .source.id else .src}”
}
12
Third attempt
{
“event_type” : “View”,
“insert_id” : ${ .object.id },
“source” : if ${ .source } ${ .source.id } else ${ .src }
}
13
Proof-of-concept
• Implement real-world transforms in this language
• before it was implemented
• Helped improve and solidify the design
• Verified that the language could do what we
needed
• Transforms looked quite reasonable.
14
A simple language
• JSON is written in JSON syntax
• evaluates to itself
• if <expr> <expr> else <expr>
• [for <expr> <expr>]
• let <name> = <expr>
• ${ … jq … }
15
Matchers
{
“event_type” : “View”,
“insert_id” : ${ .object.id },
* : ${ . }
}
16
{
“event_type” : “View”,
“insert_id” : ${ .object.id },
* - “object” : ${ . }
}
Stunt prototype
• Most of it implemented in two days
• Implemented in Scala
• using Antlr 3 to generate the parser
• jackson-jq for jq
• jackson for JSON
• A simple object tree interpreter
• Constructor.construct(Context, JsonNode) => JsonNode
17
Object tree{
“event_type” : “View”,
“insert_id” : ${ .object.id },
“source” : if ${ .source } ${ .source.id } else ${ .src }
}
18
ObjectConstructor
PairConstructor(LiteralConstructor)
PairConstructor(JqConstructor)
PairConstructor(IfConstructor(Jq, Jq, Jq))
Literal expression
19
Object expression
20
Create object
Construct value
Add to object
If
21
Evaluate
condition
Construct then
Construct else
The parser
• Code that takes a character stream and builds the
expression tree
• Use a parser generator to handle the difficult part
• requires writing a grammar
• Parser generator produces Abstract Syntax Tree
• basically corresponds to the grammar structure
22
Antlr Grammar
grammar Jstl;
WS : [ trn]+ -> skip ; // ignore whitespace
COMMENT : '//' ~[rn]* [rn] -> skip ; // ignore comments
STRING : '"' ((~["]) | ('"'))+ '"' ;
INT: '-'? [0-9]+ ;
FLOAT: '-'? [0-9]+ '.' [0-9]+ ;
NULL: 'null';
BOOL: 'true' | 'false';
IDENT: [A-Za-z] [_A-Za-z]* ;
JQ : '$' '{' (~[}"] | '"' (~["] | '' .)* '"')+ '}' ;
jstl : let* expr EOF ;
expr : object | array | STRING | INT | FLOAT | NULL | BOOL |
JQ | ifvalue | forexpr | parenthesis;
23
Parser
24
Language in use
• Implemented Data Quality Tooling using jq
• filters done in jq
• Implemented routing using jq filters
• and transforms in JSLT
• Wrote some transforms using the language
• anonymization of tracking data
• cleanup transforms to handle bad data
• …
25
The good
• The language works
• proven by DQT, routing, and transforms
• Minimal implementation effort required
• Users approved of the language
• general agreement it was a major improvement
• people started writing their own transforms
26
The bad
• Performance could be better
• not horrible, but not great, either
• The ${ … } wrappers are really ugly
• jq
• does not handle missing data well
• has dangerous features
• has weird and difficult syntax for some things
• Too many dependencies
• Scala runtime (with versioning issues)
• Antlr runtime
27
2.0
• Implement the complete language ourselves
• goodbye ${ … }
• Get rid of the jq strangeness
• Add some new functionality
• Implement in pure Java with JavaCC
• JavaCC has no runtime dependencies
• only dependency is Jackson
28
JSLT expressions
29
.foo Get “foo” key from input object
.foo.bar Get “foo”, then “.bar” on that
.foo == 231 Comparison
.foo and .bar < 12 Boolean operator
$baz.foo Variable reference
test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
JSLT transforms
30
Anonymization
31
Sinks
VG-FrontExperimentsEngagement-1:
eventType: PulseAnonymized
lter: get-client(.) == "vg" and ."@type" == "Engagement" and
contains(.object."@type", ["Article", "SalesPoster"]) and
(contains("df-86-", .origin.terms) or
contains("df-86-", .object."spt:custom".terms))
transform: transforms/vg-article-views.jslt
kinesis:
arn: arn:aws:kinesis:eu-west-1:070941167498:stream/
vg_front_experiments_engagement
role: arn:aws:iam::070941167498:role/data-platform-kinesis-write
32
Expressions
• + - / *
• and or
• > < <= != == >=
• literals
• variable references
• function calls (+ function library)
33
Dealing with missing data
• 2 + null => null
• size(null) => null
• number(“12”) => 12
• number(null) => null
34
Operators
35
Evaluate left & right
Do the operation
Numeric operators
36
Null handling
Convert to numbers
or error
Decide if int or float
The actual operations
Minus
37
Object for expressions
38
{
“event_type” : “View”,
“insert_id” : .object.id,
“source” : if (.source) .source.id else .src
} + {
for (.custom)
“custom_” + .key : .value
}
{
“object” : {“id” : 21323, … },
“src” : “App123”,
“custom” : {
“time” : 4592593492,
“event”: 3433
}
}
{
“event_type” : “View”,
“insert_id” : 21323,
“source” : “App123”,
“custom_time”: 4592593492,
“custom_event” : 3433
}
Function declarations
def <name>(<param1>, <param2>)
<let>*
<expr>
Very easy to implement
Very useful
But means going Turing-complete…
39
A real function
40
Implementation
41
Turing-complete?
• Means that the language can express any
computation
• It’s known that all that’s required is
• conditionals (we have if tests)
• recursion (our functions can call themselves)
• But can this really be true?
42
N-queens
• Write a function that takes the size of the
chessboard and returns it with queens
• queens(4) =>
[
[ 0, 1, 0, 0 ],
[ 0, 0, 0, 1 ],
[ 1, 0, 0, 0 ],
[ 0, 0, 1, 0 ]
]
43https://github.com/schibsted/jslt/blob/master/examples/queens.jslt
Getting started
44
Danger?
• It’s possible to implement operations that run
forever
• But in practice the stack quickly gets too deep
• The JVM will then terminate the transform
45
Performance
• 5-10 times faster than 1.0
• The main difference: no more jackson-jq
• jackson-jq is not very efficient
• internal model is List<JsonNode>
• creates many unnecessary objects during evaluation
• does work at run-time that should be done at compile-time
46
JSLT improvements
• Value model is JsonNode
• can usually just return data from input object or from code
• Efficient internal structures
• all collections are arrays
• very fast to traverse
• Boolean short-circuiting
• once we know the result, stop evaluating
• Cache regular expressions to avoid recompiling
47
The optimizer
• An optimizer is a function that takes an expression
and outputs an expression such that
• the new expression is at least as fast, and
• always outputs the same value
• Improves performance quite substantially even with
very simple techniques
48
Constant folding
contains(get-client(.), [“vg”, “aftenposten”, “bt”])
49
FunctionExpression
FunctionExpression
DotExpression
ArrayExpression
LiteralExpression
LiteralExpression
LiteralExpression
FunctionExpression
FunctionExpression
DotExpression
LiteralExpression
Implementation
50
Performance
• Test case: pulse-cleanup.jslt, real data, my laptop
• a complicated transform: 165 lines
• Transforms 132,000 events/second in one thread
• 1.0 did about 20,000 events/second
51
Three strategies
• Syntax tree interpreter
• known to be the slowest approach
• Bytecode compiler with virtual machine
• C version of jq does this
• Java does that (until the JIT kicks in)
• Python does this
• Native compilation
• what JIT compiler in Java does
52
Designing a VM
53
Opcode Param
DUP
MKOBJ
CALL <func>
int[] bytecode;
JsonNode[] stack;
int top;
switch (opcode) {
case DUP:
stack[++top] = stack[top-1];
break;
case MKOBJ:
stack[++top] = mapper.createObj…
break;
// …
}
Compiler
• Traverse down the object tree
• emit bytecode as you go
• Stack nature of the VM matches object tree
structure
• each Expression produces code that leaves the value of
that expression on the stack
• Example:
• MKARR, <first value>, ARRADD, <second>, ARRADD, …
54
Prototype
• Stunt implemented over a couple of days
• Depressing result: object tree interpreter ~40%
faster
• Anthony Sparks tried the same thing
• original VM implementation 5x slower
• eventually managed to achieve performance parity
• So far: performance does not justify complexity
55
Java bytecode?
• The JVM is actually a stack-based VM
• can simply compile to Java bytecode instead
• Tricky to learn tools for generating bytecode
• no examples, very little documentation
• In the end decided to use the Asm library
• not very nice to use
• very primitive API
• crashes with NullPointerException on bad bytecode
56
Compiler
57
Compile dot accessor
58
Results
• Hard work to build
• many surprising issues in Java bytecode
• Performance boost of 15-25%
• code lives on jvm-bytecode branch in Github
• Ideas for how it could be even faster…
• through type inference
59
Type inference benets
"sdrn:" + $namespace + ":" + $rType + ":" + $rId
Plus
Plus(“sdrn” $namespace)
Plus(
Plus(“:” $rType)
Plus(“:” $rId)
)
60
Plus:
JsonNode -> String
JsonNode -> String
String + String
new String -> new JsonNode
Will make 4 unnecessary TextNode objects
Will wrap and unwrap String repeatedly
Will check types unnecessarily
Solution
• + operator can ask both sides: what type will you
produce?
• If one side says “string” then the result will be a string
• When compiling, do compile(generator, String)
• will compile code that produces a Java String object
• + operator will make a new String if that’s what’s
wanted
• or turn it into a TextNode if the context wants Any
61
Freedom from Jackson
• The current codebase is bound to Jackson
• JVM bytecode compilation might be a way to
escape that
• Could build compilers that can interface with
different JSON representations
• Have ideas for a more efficient JSON representation
• basically encode everything as arrays of ints
• should save memory, GC, and produce faster code
62
Freedom from JSON
• If we aren’t bound to Jackson, why should we be
bound to JSON?
• Could support Avro, too
• Perhaps also other formats
63
Conclusion
Internal status
• JSLT now used in
• Data Quality Tooling (to express tests on data)
• routing filters
• transforms
• In Schibsted we have
• 52 transforms, 2370 lines of code
• written by many people in different parts of the company
• Data Platform runs ~11 billion transforms/day
65
Open source status
• Released in June
• People are using it for real
• one certain case, several more examples
• details unknown
• Useful contributions from outsiders
• several bug fixes to datetime/number handling
• Two alternative implementations being worked on
• one in .NET
• one is virtual machine-based in Java
66
Lessons learned
• A custom language can make life much simpler
• if it fits the use case well
• Implementing a language is easier than it seems
• basically doable in a week
• Designing a language is not easy
• unfortunately
67
https://github.com/schibsted/jslt
https://www.slideshare.net/larsga
Slides at

Weitere ähnliche Inhalte

Was ist angesagt?

Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!sparkfabrik
 
GitHub Basics - Derek Bable
GitHub Basics - Derek BableGitHub Basics - Derek Bable
GitHub Basics - Derek Bable"FENG "GEORGE"" YU
 
Git hub ppt presentation
Git hub ppt presentationGit hub ppt presentation
Git hub ppt presentationAyanaRukasar
 
Logs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackJosef KarĂĄsek
 
Collaborative Editing Tools for Alfresco
Collaborative Editing Tools for AlfrescoCollaborative Editing Tools for Alfresco
Collaborative Editing Tools for AlfrescoAngel Borroy LĂłpez
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningGuido Schmutz
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101Paul Podolny
 
Go Programlama Dili - Seminer
Go Programlama Dili - SeminerGo Programlama Dili - Seminer
Go Programlama Dili - SeminerCihan Özhan
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflowmutt_data
 
SLF4J (Simple Logging Facade for Java)
SLF4J (Simple Logging Facade for Java)SLF4J (Simple Logging Facade for Java)
SLF4J (Simple Logging Facade for Java)Guo Albert
 
Maven Introduction
Maven IntroductionMaven Introduction
Maven IntroductionSandeep Chawla
 

Was ist angesagt? (20)

Prometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome MonitoringPrometheus + Grafana = Awesome Monitoring
Prometheus + Grafana = Awesome Monitoring
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!
 
GitHub Basics - Derek Bable
GitHub Basics - Derek BableGitHub Basics - Derek Bable
GitHub Basics - Derek Bable
 
Alfresco Certificates
Alfresco Certificates Alfresco Certificates
Alfresco Certificates
 
Git hub ppt presentation
Git hub ppt presentationGit hub ppt presentation
Git hub ppt presentation
 
Logs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK Stack
 
Collaborative Editing Tools for Alfresco
Collaborative Editing Tools for AlfrescoCollaborative Editing Tools for Alfresco
Collaborative Editing Tools for Alfresco
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & PartitioningApache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
Apache Kafka - Event Sourcing, Monitoring, Librdkafka, Scaling & Partitioning
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
 
Go Programlama Dili - Seminer
Go Programlama Dili - SeminerGo Programlama Dili - Seminer
Go Programlama Dili - Seminer
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
Upgrading to Alfresco 6
Upgrading to Alfresco 6Upgrading to Alfresco 6
Upgrading to Alfresco 6
 
SLF4J (Simple Logging Facade for Java)
SLF4J (Simple Logging Facade for Java)SLF4J (Simple Logging Facade for Java)
SLF4J (Simple Logging Facade for Java)
 
Jenkins CI
Jenkins CIJenkins CI
Jenkins CI
 
Maven Introduction
Maven IntroductionMaven Introduction
Maven Introduction
 

Ähnlich wie JSLT: JSON querying and transformation

Javascript Everywhere
Javascript EverywhereJavascript Everywhere
Javascript EverywherePascal Rettig
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevMattias Karlsson
 
Alternatives of JPA/Hibernate
Alternatives of JPA/HibernateAlternatives of JPA/Hibernate
Alternatives of JPA/HibernateSunghyouk Bae
 
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem novaKotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem novaVĂ­ctor Leonel Orozco LĂłpez
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...Fabio Franzini
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!Ortus Solutions, Corp
 
Javascript done right - Open Web Camp III
Javascript done right - Open Web Camp IIIJavascript done right - Open Web Camp III
Javascript done right - Open Web Camp IIIDirk Ginader
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffJAX London
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Mario Camou Riveroll
 
Awesomeness of JavaScript…almost
Awesomeness of JavaScript…almostAwesomeness of JavaScript…almost
Awesomeness of JavaScript…almostQuinton Sheppard
 
React Native Evening
React Native EveningReact Native Evening
React Native EveningTroy Miles
 
Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!Eric Phan
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxsandeshshahapur
 
Spring data requery
Spring data requerySpring data requery
Spring data requerySunghyouk Bae
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9Ivan Krylov
 
How to use the new Domino Query Language
How to use the new Domino Query LanguageHow to use the new Domino Query Language
How to use the new Domino Query LanguageTim Davis
 
Front end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreFront end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreWeb Zhao
 

Ähnlich wie JSLT: JSON querying and transformation (20)

JS Essence
JS EssenceJS Essence
JS Essence
 
Javascript Everywhere
Javascript EverywhereJavascript Everywhere
Javascript Everywhere
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Java 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from OredevJava 7 Whats New(), Whats Next() from Oredev
Java 7 Whats New(), Whats Next() from Oredev
 
Alternatives of JPA/Hibernate
Alternatives of JPA/HibernateAlternatives of JPA/Hibernate
Alternatives of JPA/Hibernate
 
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem novaKotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
Kotlin+MicroProfile: Ensinando 20 anos para uma linguagem nova
 
WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...WebNet Conference 2012 - Designing complex applications using html5 and knock...
WebNet Conference 2012 - Designing complex applications using html5 and knock...
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!
 
Javascript done right - Open Web Camp III
Javascript done right - Open Web Camp IIIJavascript done right - Open Web Camp III
Javascript done right - Open Web Camp III
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
 
Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?Static or Dynamic Typing? Why not both?
Static or Dynamic Typing? Why not both?
 
Awesomeness of JavaScript…almost
Awesomeness of JavaScript…almostAwesomeness of JavaScript…almost
Awesomeness of JavaScript…almost
 
React Native Evening
React Native EveningReact Native Evening
React Native Evening
 
Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!Why you should be using the shiny new C# 6.0 features now!
Why you should be using the shiny new C# 6.0 features now!
 
gdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptxgdscWorkShopJavascriptintroductions.pptx
gdscWorkShopJavascriptintroductions.pptx
 
Spring data requery
Spring data requerySpring data requery
Spring data requery
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
What to expect from Java 9
What to expect from Java 9What to expect from Java 9
What to expect from Java 9
 
How to use the new Domino Query Language
How to use the new Domino Query LanguageHow to use the new Domino Query Language
How to use the new Domino Query Language
 
Front end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript coreFront end fundamentals session 1: javascript core
Front end fundamentals session 1: javascript core
 

Mehr von Lars Marius Garshol

Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithmsLars Marius Garshol
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/dayLars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityLars Marius Garshol
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDFLars Marius Garshol
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutesLars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparatorsLars Marius Garshol
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programmingLars Marius Garshol
 

Mehr von Lars Marius Garshol (20)

Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 

KĂźrzlich hochgeladen

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 

KĂźrzlich hochgeladen (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 

JSLT: JSON querying and transformation

  • 1. JSLT: JSON query & transform Lars Marius Garshol, lars.marius.garshol@schibsted.com http://twitter.com/larsga 2018–09–12, JavaZone 2018
  • 4. Routing • We send data to ~210 different destinations • Filters on the data specify which data should go where • often very detailed conditions on many elds • Full routing tree has ~600 lter/transform/sink nodes 4
  • 5. Transforms • Because GDPR we need to anonymize most incoming data formats • Some data has data quality issues that cannot be xed at source, requires transforms to solve • In many cases data needs to be transformed from one format to another • Pulse to Amplitude • Pulse to Adobe Analytics • ClickMeter to Pulse • Convert data to match database structures • … 5
  • 6. Who congures? • Schibsted has >100 business units • for Data Platform to do detailed conguration for all of these isn’t going to scale • for sites to do it themselves saves lots of time • Conguration requires domain knowledge • each site has its own specialities in Pulse tracking • to transform and route these correctly requires knowing all this 6
  • 7. Batch cong: 1 sink { "driver": "anyoflter", "name": "image-classication", "rules": [ { "name": "ImageClassication", "key": "provider.component", "value": "ImageClassication" }, { "name": "ImageSimilarity", "key": "provider.component", "value": "ImageSimilarity" } ], "onmatch": [ { "driver": "cache", "name": "image-classication", "level": "memory+disk" }, { "driver": "demuxer", "name": "image-classication", "rules": "${pulseSdrnFilterUri}", "parallel": true, "onmatch": { "driver": "textlewriter", "uri": "${imageSiteUri}", "numFiles": { "eventsPerFile": 500000, "max": ${numExecutors} } } } ], "consume": true 7 Early cong was 1838 lines } JSON matching
  • 9. What if? • We had an expression language for JSON • something like, say, XPath for JSON • could write routing lters using that • We had a tranformation language for JSON • write as JSON template, using expression language to compute values to insert • A custom routing language for both batch and streaming, based on this language • designed for easy expressivity & deploy 9
  • 10. • Already existing query language for JSON • https://stedolan.github.io/jq/ • Originally implemented in C • there is a Java implementation, too • Can do things like • .foo • .foo.bar • .foo.bar > 25 • … 10
  • 11. First, fumbling attempt { “event_type” : “View”, “insert_id” : {“__expr__” : “.object.id”}, “source” : {“__if__” : { “test” : “.source”, “then” : “.source.id”, “else” : “.src” } } 11
  • 12. Second, fumbling attempt { “event_type” : “View”, “insert_id” : “${.object.id}”, “source” : “${if (.source) .source.id else .src}” } 12
  • 13. Third attempt { “event_type” : “View”, “insert_id” : ${ .object.id }, “source” : if ${ .source } ${ .source.id } else ${ .src } } 13
  • 14. Proof-of-concept • Implement real-world transforms in this language • before it was implemented • Helped improve and solidify the design • Veried that the language could do what we needed • Transforms looked quite reasonable. 14
  • 15. A simple language • JSON is written in JSON syntax • evaluates to itself • if <expr> <expr> else <expr> • [for <expr> <expr>] • let <name> = <expr> • ${ … jq … } 15
  • 16. Matchers { “event_type” : “View”, “insert_id” : ${ .object.id }, * : ${ . } } 16 { “event_type” : “View”, “insert_id” : ${ .object.id }, * - “object” : ${ . } }
  • 17. Stunt prototype • Most of it implemented in two days • Implemented in Scala • using Antlr 3 to generate the parser • jackson-jq for jq • jackson for JSON • A simple object tree interpreter • Constructor.construct(Context, JsonNode) => JsonNode 17
  • 18. Object tree{ “event_type” : “View”, “insert_id” : ${ .object.id }, “source” : if ${ .source } ${ .source.id } else ${ .src } } 18 ObjectConstructor PairConstructor(LiteralConstructor) PairConstructor(JqConstructor) PairConstructor(IfConstructor(Jq, Jq, Jq))
  • 22. The parser • Code that takes a character stream and builds the expression tree • Use a parser generator to handle the difcult part • requires writing a grammar • Parser generator produces Abstract Syntax Tree • basically corresponds to the grammar structure 22
  • 23. Antlr Grammar grammar Jstl; WS : [ trn]+ -> skip ; // ignore whitespace COMMENT : '//' ~[rn]* [rn] -> skip ; // ignore comments STRING : '"' ((~["]) | ('"'))+ '"' ; INT: '-'? [0-9]+ ; FLOAT: '-'? [0-9]+ '.' [0-9]+ ; NULL: 'null'; BOOL: 'true' | 'false'; IDENT: [A-Za-z] [_A-Za-z]* ; JQ : '$' '{' (~[}"] | '"' (~["] | '' .)* '"')+ '}' ; jstl : let* expr EOF ; expr : object | array | STRING | INT | FLOAT | NULL | BOOL | JQ | ifvalue | forexpr | parenthesis; 23
  • 25. Language in use • Implemented Data Quality Tooling using jq • lters done in jq • Implemented routing using jq lters • and transforms in JSLT • Wrote some transforms using the language • anonymization of tracking data • cleanup transforms to handle bad data • … 25
  • 26. The good • The language works • proven by DQT, routing, and transforms • Minimal implementation effort required • Users approved of the language • general agreement it was a major improvement • people started writing their own transforms 26
  • 27. The bad • Performance could be better • not horrible, but not great, either • The ${ … } wrappers are really ugly • jq • does not handle missing data well • has dangerous features • has weird and difcult syntax for some things • Too many dependencies • Scala runtime (with versioning issues) • Antlr runtime 27
  • 28. 2.0 • Implement the complete language ourselves • goodbye ${ … } • Get rid of the jq strangeness • Add some new functionality • Implement in pure Java with JavaCC • JavaCC has no runtime dependencies • only dependency is Jackson 28
  • 29. JSLT expressions 29 .foo Get “foo” key from input object .foo.bar Get “foo”, then “.bar” on that .foo == 231 Comparison .foo and .bar < 12 Boolean operator $baz.foo Variable reference test(.foo, “^[a-z0-9]+$”) Functions (& regexps)
  • 32. Sinks VG-FrontExperimentsEngagement-1: eventType: PulseAnonymized lter: get-client(.) == "vg" and ."@type" == "Engagement" and contains(.object."@type", ["Article", "SalesPoster"]) and (contains("df-86-", .origin.terms) or contains("df-86-", .object."spt:custom".terms)) transform: transforms/vg-article-views.jslt kinesis: arn: arn:aws:kinesis:eu-west-1:070941167498:stream/ vg_front_experiments_engagement role: arn:aws:iam::070941167498:role/data-platform-kinesis-write 32
  • 33. Expressions • + - / * • and or • > < <= != == >= • literals • variable references • function calls (+ function library) 33
  • 34. Dealing with missing data • 2 + null => null • size(null) => null • number(“12”) => 12 • number(null) => null 34
  • 35. Operators 35 Evaluate left & right Do the operation
  • 36. Numeric operators 36 Null handling Convert to numbers or error Decide if int or float The actual operations
  • 38. Object for expressions 38 { “event_type” : “View”, “insert_id” : .object.id, “source” : if (.source) .source.id else .src } + { for (.custom) “custom_” + .key : .value } { “object” : {“id” : 21323, … }, “src” : “App123”, “custom” : { “time” : 4592593492, “event”: 3433 } } { “event_type” : “View”, “insert_id” : 21323, “source” : “App123”, “custom_time”: 4592593492, “custom_event” : 3433 }
  • 39. Function declarations def <name>(<param1>, <param2>) <let>* <expr> Very easy to implement Very useful But means going Turing-complete… 39
  • 42. Turing-complete? • Means that the language can express any computation • It’s known that all that’s required is • conditionals (we have if tests) • recursion (our functions can call themselves) • But can this really be true? 42
  • 43. N-queens • Write a function that takes the size of the chessboard and returns it with queens • queens(4) => [ [ 0, 1, 0, 0 ], [ 0, 0, 0, 1 ], [ 1, 0, 0, 0 ], [ 0, 0, 1, 0 ] ] 43https://github.com/schibsted/jslt/blob/master/examples/queens.jslt
  • 45. Danger? • It’s possible to implement operations that run forever • But in practice the stack quickly gets too deep • The JVM will then terminate the transform 45
  • 46. Performance • 5-10 times faster than 1.0 • The main difference: no more jackson-jq • jackson-jq is not very efcient • internal model is List<JsonNode> • creates many unnecessary objects during evaluation • does work at run-time that should be done at compile-time 46
  • 47. JSLT improvements • Value model is JsonNode • can usually just return data from input object or from code • Efcient internal structures • all collections are arrays • very fast to traverse • Boolean short-circuiting • once we know the result, stop evaluating • Cache regular expressions to avoid recompiling 47
  • 48. The optimizer • An optimizer is a function that takes an expression and outputs an expression such that • the new expression is at least as fast, and • always outputs the same value • Improves performance quite substantially even with very simple techniques 48
  • 49. Constant folding contains(get-client(.), [“vg”, “aftenposten”, “bt”]) 49 FunctionExpression FunctionExpression DotExpression ArrayExpression LiteralExpression LiteralExpression LiteralExpression FunctionExpression FunctionExpression DotExpression LiteralExpression
  • 51. Performance • Test case: pulse-cleanup.jslt, real data, my laptop • a complicated transform: 165 lines • Transforms 132,000 events/second in one thread • 1.0 did about 20,000 events/second 51
  • 52. Three strategies • Syntax tree interpreter • known to be the slowest approach • Bytecode compiler with virtual machine • C version of jq does this • Java does that (until the JIT kicks in) • Python does this • Native compilation • what JIT compiler in Java does 52
  • 53. Designing a VM 53 Opcode Param DUP MKOBJ CALL <func> int[] bytecode; JsonNode[] stack; int top; switch (opcode) { case DUP: stack[++top] = stack[top-1]; break; case MKOBJ: stack[++top] = mapper.createObj… break; // … }
  • 54. Compiler • Traverse down the object tree • emit bytecode as you go • Stack nature of the VM matches object tree structure • each Expression produces code that leaves the value of that expression on the stack • Example: • MKARR, <rst value>, ARRADD, <second>, ARRADD, … 54
  • 55. Prototype • Stunt implemented over a couple of days • Depressing result: object tree interpreter ~40% faster • Anthony Sparks tried the same thing • original VM implementation 5x slower • eventually managed to achieve performance parity • So far: performance does not justify complexity 55
  • 56. Java bytecode? • The JVM is actually a stack-based VM • can simply compile to Java bytecode instead • Tricky to learn tools for generating bytecode • no examples, very little documentation • In the end decided to use the Asm library • not very nice to use • very primitive API • crashes with NullPointerException on bad bytecode 56
  • 59. Results • Hard work to build • many surprising issues in Java bytecode • Performance boost of 15-25% • code lives on jvm-bytecode branch in Github • Ideas for how it could be even faster… • through type inference 59
  • 60. Type inference benets "sdrn:" + $namespace + ":" + $rType + ":" + $rId Plus Plus(“sdrn” $namespace) Plus( Plus(“:” $rType) Plus(“:” $rId) ) 60 Plus: JsonNode -> String JsonNode -> String String + String new String -> new JsonNode Will make 4 unnecessary TextNode objects Will wrap and unwrap String repeatedly Will check types unnecessarily
  • 61. Solution • + operator can ask both sides: what type will you produce? • If one side says “string” then the result will be a string • When compiling, do compile(generator, String) • will compile code that produces a Java String object • + operator will make a new String if that’s what’s wanted • or turn it into a TextNode if the context wants Any 61
  • 62. Freedom from Jackson • The current codebase is bound to Jackson • JVM bytecode compilation might be a way to escape that • Could build compilers that can interface with different JSON representations • Have ideas for a more efcient JSON representation • basically encode everything as arrays of ints • should save memory, GC, and produce faster code 62
  • 63. Freedom from JSON • If we aren’t bound to Jackson, why should we be bound to JSON? • Could support Avro, too • Perhaps also other formats 63
  • 65. Internal status • JSLT now used in • Data Quality Tooling (to express tests on data) • routing lters • transforms • In Schibsted we have • 52 transforms, 2370 lines of code • written by many people in different parts of the company • Data Platform runs ~11 billion transforms/day 65
  • 66. Open source status • Released in June • People are using it for real • one certain case, several more examples • details unknown • Useful contributions from outsiders • several bug xes to datetime/number handling • Two alternative implementations being worked on • one in .NET • one is virtual machine-based in Java 66
  • 67. Lessons learned • A custom language can make life much simpler • if it ts the use case well • Implementing a language is easier than it seems • basically doable in a week • Designing a language is not easy • unfortunately 67