2. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 2/75
WHENCE?WHENCE?
NLP Infrastructure Technical Lead @ Grammarly
Clojure, Common Lisp, Java
Services that improve writing of 30 million users (15 million daily)
2 . 1
3. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 3/75
WHY SHOULD YOU CARE ABOUTWHY SHOULD YOU CARE ABOUT
PERFORMANCE?PERFORMANCE?
premature optimization is the root of all evil.
— Donald Knuth
3 . 1
4. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 4/75
WHY SHOULD YOU CARE ABOUTWHY SHOULD YOU CARE ABOUT
PERFORMANCE?PERFORMANCE?
"We should forget about small efficiencies, say about 97% of the
time: premature optimization is the root of all evil. Yet we should
not pass up our opportunities in that critical 3%." — Donald Knuth
4 . 1
5. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 5/75
PERFORMANCE OPTIMIZATION FALLACIESPERFORMANCE OPTIMIZATION FALLACIES
"Hardware is cheap, programmers are expensive."
A.K.A. "Just throw more machines into it."
5 . 1
6. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 6/75
PERFORMANCE OPTIMIZATION FALLACIESPERFORMANCE OPTIMIZATION FALLACIES
3 c5.9xlarge EC2 instances: $3,351 monthly
10 c5.9xlarge EC2 instances: $11,170 monthly
Is it worth to spend 1 person-month to optimize from 10 to 3?
Probably.
6 . 1
7. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 7/75
PERFORMANCE OPTIMIZATION FALLACIESPERFORMANCE OPTIMIZATION FALLACIES
"Hardware is cheap, programmers are expensive."
A.K.A. "Just throw more machines into it."
"Docker/Kubernetes/microservices/cloud/whatever allows you to
scale horizontally."
7 . 1
8. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 8/75
PERFORMANCE OPTIMIZATION FALLACIESPERFORMANCE OPTIMIZATION FALLACIES
There's no such thing as effortless horizontal scaling.
At each next order of magnitude you get new headaches:
More infrastructure (balancers, service discovery, queues, …)
Configuration management
Observability
Deployment story
Debugging story
Complexity of setting up testing environments
Whole bunch of second-order effects
Mental tax
You hire more devops/platform engineers/SREs to deal with this.
8 . 1
9. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 9/75
PERFORMANCE OPTIMIZATION FALLACIESPERFORMANCE OPTIMIZATION FALLACIES
"Hardware is cheap, programmers are expensive."
A.K.A. "Just throw more machines into it."
"Docker/Kubernetes/microservices/cloud/whatever allow us to
scale horizontally."
9 . 1
10. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 10/75
WHY SHOULD YOU CARE ABOUTWHY SHOULD YOU CARE ABOUT
PERFORMANCE?PERFORMANCE?
Ability to distinguish between those 97% and 3% is crucial in
building effective so ware.
That ability requires:
Knowledge
Tools
Experience
Experience comes from practice.
10 . 1
11. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 11/75
WHAT CLOJURE HAS TO DO WITH ANY OFWHAT CLOJURE HAS TO DO WITH ANY OF
THIS?THIS?
11 . 1
12. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 12/75
CLOJURE IS FASTCLOJURE IS FAST
Dynamically compiled language
World-class JVM JIT for free
Data structures with performance in mind
Conservative polymorphism features
Ability to drop down to Java where necessary
12 . 1
13. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 13/75
CLOJURE IS VERSATILECLOJURE IS VERSATILE
REPL is the best so ware design tool you can get.
Applies to performance work too.
Hundreds of people work on creating tools for measuring and
improving performance on JVM.
Easily usable from Clojure.
13 . 1
14. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 14/75
WAYS TO MEASURE HOW FAST/SLOW ISWAYS TO MEASURE HOW FAST/SLOW IS
SOMETHINGSOMETHING
1. "Feels slow"
2. Wrist stopwatch
3. (time ...)
4. (time (dotimes [_ 10000] ...)
5. Criterium
14 . 1
16. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 16/75
REFLECTIONREFLECTION
185x speedup! But why?
(require '[criterium.core :as crit])
(def s "This gotta be good")
(crit/quick-bench (.substring s 5 18))
;; Execution time mean : 2.760464 µs
(crit/quick-bench (.substring ^String s 5 18))
;; Execution time mean : 14.897897 ns
16 . 1
17. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 17/75
REFLECTIONREFLECTION
Reflection is Java's introspection mechanism for resolving and
calling the program's building blocks (classes, fields, methods) at
runtime.
In the same spirit as Clojure's resolve, ns-publics, apply.
Common explanation is "reflection is slow".
17 . 1
18. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 18/75
REFLECTIONREFLECTION
We can use Java Reflection directly from Clojure.
Turns out the reflective call itself is not that slow. Maybe it's the
resolution of the method?
(def m (.getDeclaredMethod String "substring"
(into-array Class [Integer/TYPE Integer/TYPE])))
;; returns java.lang.reflect.Method object
(crit/quick-bench (.invoke ^Method m s
(object-array [(Integer. 5) (Integer. 18)])))
;; Execution time mean : 107.801748 ns
(crit/quick-bench
(let [^Method m (.getDeclaredMethod
String "substring"
(into-array Class [Integer/TYPE Integer/TYPE]))]
(.invoke m string (object-array [(Integer. 5) (Integer. 18)]))))
;; Execution time mean : 648.579085 ns
18 . 1
19. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 19/75
REFLECTIONREFLECTION
What's really going on when Clojure performs a reflective call?
One way is to dig into clojure.lang.Compiler (9k SLOC).
Another way is to use clj-java-decompiler library.
19 . 1
22. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 22/75
INSIDE CLOJURE/LANG/REFLECTOR.JAVAINSIDE CLOJURE/LANG/REFLECTOR.JAVA
static Object invokeInstanceMethod(Object target, String methodName,
Object[] args) {
Class c = target.getClass();
List methods = getMethods(c, args.length, methodName, false);
return invokeMatchingMethod(methodName, methods, target, args);
}
static List getMethods(Class c, int arity, String name, boolean getStatics) {
ArrayList methods = new ArrayList();
for (Method m : c.getMethods())
if (name.equals(method.getName()))
methods.add(method);
return methods;
}
static Object invokeMatchingMethod(String methodName, List methods,
Object target, Object[] args) {
Method foundm = null;
for (Method m : methods) {
Class[] params = m.getParameterTypes();
if(isCongruent(params, args))
foundm = m;
}
foundm.invoke(target, args);
}
22 . 1
23. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 23/75
REFLECTIONREFLECTION
On a reflective call, Clojure looks through all methods of the class
linearly, at runtime.
No wonder why reflective calls are so slow!
23 . 1
24. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 24/75
WAYS TO COMBAT REFLECTIONWAYS TO COMBAT REFLECTION
Enable *warn-on-reflection*
Use type hints
And occasionally check with clj-java-decompiler.
(set! *warn-on-reflection* true)
(.substring s 5 18)
;; Reflection warning, .../slides.clj:114:12 - call to
;; method substring can't be resolved (target class is unknown).
24 . 1
25. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 25/75
SHOULD REFLECTION BE WEEDED OUTSHOULD REFLECTION BE WEEDED OUT
EVERYWHERE?EVERYWHERE?
There's nothing wrong with having zero-reflection policy.
But a few stray reflection calls won't hurt if they aren't called
o en.
You should profile to know for sure.
25 . 1
26. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 26/75
CLJ-ASYNC-PROFILERCLJ-ASYNC-PROFILER
The most convenient profiler-as-a-library for Clojure.
https://github.com/clojure-goes-fast/clj-async-profiler
(require '[clj-async-profiler.core :as prof])
(prof/profile
(crit/quick-bench (.substring s 5 18)))
26 . 1
31. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 31/75
CLJ-ASYNC-PROFILERCLJ-ASYNC-PROFILER
Profiler that is controllable from your code.
Instant feedback without leaving the REPL.
Flamegraphs are a great representation.
Intuitive and portable.
29 . 1
34. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 34/75
BOXINGBOXING
Boxing means wrapping primitive types into objects.
19x difference — not bad!
(let [nums (vec (range 1e6))]
(crit/quick-bench (reduce + nums)))
;; Execution time mean : 18.384708 ms
(let [^longs nums (into-array Long/TYPE (range 1e6))]
(crit/quick-bench
(areduce nums i acc 0
(+ acc (aget nums i)))))
;; Execution time mean : 971.487253 µs
32 . 1
35. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 35/75
BOXINGBOXING
(decompile
(let [^longs nums (into-array Long/TYPE (range 1e6))]
(areduce nums i acc 0
(+ acc (aget nums i)))))
final Object nums = core$into_array.invokeStatic(
(Object)Long.TYPE, core$range.invokeStatic(100000));
final int lng = ((long[])nums).length;
long i = 0L;
long acc = 0L;
while (i < lng) {
final long n = RT.intCast(i) + 1;
acc = Numbers.add(acc, ((long[])nums)[RT.intCast(i)]);
i = n;
}
return Numbers.num(acc);
33 . 1
36. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 36/75
WAYS TO COMBAT BOXINGWAYS TO COMBAT BOXING
Profile to ensure that boxing is really a problem.
Arrays instead of lists and vectors.
Primitive type hints and casts.
(set! *unchecked-math* :warn-on-boxed)
34 . 1
39. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 39/75
WAYS TO COMBAT BOXINGWAYS TO COMBAT BOXING
Profile to ensure that boxing is really a problem.
Arrays instead of lists and vectors.
Primitive type hints.
(set! *unchecked-math* :warn-on-boxed)
clj-java-decompiler
37 . 1
40. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 40/75
CLJ-JAVA-DECOMPILERCLJ-JAVA-DECOMPILER
(decompile
(let [init (fn [] 1)]
(loop [i (init), res (init)]
(if (< i 10)
(recur (inc i) (* res i))
res))))
Object init = new slides$fn__17198$init__17199();
Object i = ((IFn)init).invoke();
Object res = ((IFn)init).invoke();
while (Numbers.lt(i, 10L)) {
final Object i2 = Numbers.inc(i);
res = Numbers.multiply(res, i);
i = i2;
}
return res;
38 . 1
41. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 41/75
CLJ-JAVA-DECOMPILERCLJ-JAVA-DECOMPILER
(decompile
(let [init (fn [] 1)]
(loop [i (long (init)), res (long (init))]
(if (< i 10)
(recur (inc i) (* res i))
res))))
Object init = new slides$fn__17198$init__17199();
long i = ((IFn)init).invoke();
long res = ((IFn)init).invoke();
while (Numbers.lt(i, 10L)) {
final Object i2 = Numbers.inc(i);
res = Numbers.multiply(res, i);
i = i2;
}
return res;
39 . 1
42. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 42/75
WAYS TO COMBAT BOXINGWAYS TO COMBAT BOXING
Profile to ensure that boxing is really a problem.
Arrays instead of lists and vectors.
Primitive type hints.
(set! *unchecked-math* :warn-on-boxed)
clj-java-decompiler
Write number crunching in Java.
40 . 1
43. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 43/75
WRITE JAVA IN STYLEWRITE JAVA IN STYLE
Compile Java code without leaving or restarting your REPL.
Use new classes immediately in your Clojure code.
You still have the access to all Clojure development tools.
https://github.com/ztellman/virgil
41 . 1
50. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 50/75
WAYS TO DETECT MEMORY SHORTAGEWAYS TO DETECT MEMORY SHORTAGE
(In development) VisualVM
(In production) VisualVM over JMX, jstat, …
clj-memory-meter to understand what occupies memory.
48 . 1
53. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 53/75
IMMUTABILITYIMMUTABILITY
We love immutability, but sometimes, it is unnecessary.
(crit/quick-bench
(let [obj (Object.)]
(loop [i 0, res []]
(if (< i 1e6)
(recur (inc i) (conj res obj))
res))))
;; Execution time mean : 31.455536 ms
51 . 1
54. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 54/75
WAYS TO COMBAT IMMUTABILITYWAYS TO COMBAT IMMUTABILITY
Profiler
Transients
2.2x speedup.
(crit/quick-bench
(let [obj (Object.)]
(loop [i 0, res (transient [])]
(if (< i 1e6)
(recur (inc i) (conj! res obj))
(persistent! res)))))
;; Execution time mean : 14.115719 ms
52 . 1
55. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 55/75
WAYS TO COMBAT IMMUTABILITYWAYS TO COMBAT IMMUTABILITY
Profiler
Transients
Mutable Java collections
5x speedup.
(crit/quick-bench
(let [obj (Object.)
res (ArrayList.)]
(loop [i 0]
(when (< i 1e6)
(.add res obj)
(recur (inc i))))
res))
;; Execution time mean : 6.344132 ms
53 . 1
56. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 56/75
CAVEAT EMPTORCAVEAT EMPTOR
If you need the resulting collection to be a Clojure structure,
transients are more efficient than Java classes.
(crit/quick-bench
(let [obj (Object.)
res (ArrayList.)]
(loop [i 0]
(when (< i 1e6)
(.add res obj)
(recur (inc i))))
(vec res)))
;; Execution time mean : 19.435359 ms
54 . 1
58. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 58/75
LAZINESSLAZINESS
Increases allocation pressure -> more work for GC
Worse memory locality
Harder to debug and profile
56 . 1
59. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 59/75
LAZINESSLAZINESS
Everyone did this at least once in their career:
Wow, Clojure is fast! /s
(time (dotimes [_ 1e6]
(map inc (range 1e6))))
;; Elapsed time: 30.931708 msecs
57 . 1
64. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 64/75
TOP SPEED BUMPS SO FARTOP SPEED BUMPS SO FAR
Reflection
Boxing
Insufficient memory
Immutability
Laziness
61 . 1
65. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 65/75
TOP SPEED BUMPS SO FARTOP SPEED BUMPS SO FAR
Reflection
Boxing
Insufficient memory
Immutability
Laziness
Redundant allocations
Coarsely-synchronized data structures
Context switching overhead
…
62 . 1
66. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 66/75
TOP SPEED BUMPS SO FARTOP SPEED BUMPS SO FAR
Reflection
Boxing
Insufficient memory
Immutability
Laziness
Redundant allocations
Coarsely-synchronized data structures
Context switching overhead
…
GC pauses
Megamorphic callsites
Heap fragmentation
… 63 . 1
67. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 67/75
TOP SPEED BUMPS SO FARTOP SPEED BUMPS SO FAR
Reflection
Boxing
Insufficient memory
Immutability
Laziness
Redundant allocations
Coarsely-synchronized data structures
Context switching overhead
…
GC pauses
Megamorphic callsites
Heap fragmentation
…
Cache incoherence
TLB misses (page walks)
Branch misprediction
NUMA foreign access
…
64 . 1
68. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 68/75
TOP SPEED BUMPS SO FARTOP SPEED BUMPS SO FAR
Reflection
Boxing
Insufficient memory
Immutability
Laziness
Redundant allocations
Coarsely-synchronized data structures
Context switching overhead
…
GC pauses
Megamorphic callsites
Heap fragmentation
…
Cache incoherence
TLB misses (page walks)
Branch misprediction
NUMA foreign access
…
Magnetic disturbances
CPU overheating
…
65 . 1
69. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 69/75
PERFORMANCE IS HARDPERFORMANCE IS HARD
Abstractions are constantly leaking.
The more you learn, the less you know.
Your assumptions are constantly getting invalidated.
66 . 1
70. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 70/75
PERFORMANCE IS FUN AND USEFULPERFORMANCE IS FUN AND USEFUL
You learn things behind those leaky abstractions.
You get a more holistic view of the system.
You save money and the environment.
67 . 1
71. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 71/75
PERFORMANCE PROBLEMS ARE NOTPERFORMANCE PROBLEMS ARE NOT
UNIQUE TO CLOJUREUNIQUE TO CLOJURE
But we are in a great position to solve them.
There is plenty of prior art, especially for JVM.
Tools, blogposts, experiments, reports.
REPL allows us to use all of this much more easily.
68 . 1
74. 12/4/2018 Speed bumps ahead
http://localhost:3002/clojurex-2018/?print-pdf 74/75
INSTEAD OF A CONCLUSIONINSTEAD OF A CONCLUSION
First, make it work.
Then, make it right.
Then, make it fast.
But please, don't stop at the first.
71 . 1