8. Subjects Covered in this Talk
• Background – lambdas and streams
• Performance of our example
• Effect of parallelizing
• Splitting input data efficiently
• When to go parallel
• Parallel streams in the real world
10. Predicate<Matcher> matches = new Predicate<Matcher>() {
@Override
public boolean test(Matcher matcher) {
return matcher.find();
}
};
What is a Lambda?
matcher
matcher.find()
matcher
matcher.find()
11. Predicate<Matcher> matches = new Predicate<Matcher>() {
@Override
public boolean test(Matcher matcher) {
return matcher.find();
}
};
Predicate<Matcher> matches =
What is a Lambda?
matcher
matcher.find()
matcher
matcher.find()
12. Predicate<Matcher> matches = new Predicate<Matcher>() {
@Override
public boolean test(Matcher matcher) {
return matcher.find();
}
};
Predicate<Matcher> matches =
What is a Lambda?
matcher
Predicate<Matcher> matches =
matcher.find()
matcher
matcher.find()
13. Predicate<Matcher> matches = new Predicate<Matcher>() {
@Override
public boolean test(Matcher matcher) {
return matcher.find();
}
};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches =
matcher.find()
matcher
matcher.find()
14. Predicate<Matcher> matches = new Predicate<Matcher>() {
@Override
public boolean test(Matcher matcher) {
return matcher.find();
}
};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches =
matcher.find()
->
matcher
matcher.find()
15. Predicate<Matcher> matches = new Predicate<Matcher>() {
@Override
public boolean test(Matcher matcher) {
return matcher.find();
}
};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches = matcher.find()->
matcher
matcher.find()
16. Predicate<Matcher> matches = new Predicate<Matcher>() {
@Override
public boolean test(Matcher matcher) {
return matcher.find();
}
};
Predicate<Matcher> matches =
What is a Lambda?
matcherPredicate<Matcher> matches =
A lambda is a function
from arguments to result
matcher.find()->
matcher
matcher.find()
19. Old School Code
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();
Pattern stoppedTimePattern =
Pattern.compile("Application time: (d+.d+)");
while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord);
if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1));
summary.add( value);
}
}
20. Old School Code
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();
Pattern stoppedTimePattern =
Pattern.compile("Application time: (d+.d+)");
while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord);
if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1));
summary.add( value);
}
}
Let’s look at the features in this code
21. Data Source
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();
Pattern stoppedTimePattern =
Pattern.compile("Application time: (d+.d+)");
while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord);
if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1));
summary.add( value);
}
}
22. Map to Matcher
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();
Pattern stoppedTimePattern =
Pattern.compile("Application time: (d+.d+)");
while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord);
if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1));
summary.add( value);
}
}
23. Filter
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();
Pattern stoppedTimePattern =
Pattern.compile("Application time: (d+.d+)");
while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord);
if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1));
summary.add( value);
}
}
24. Map to Double
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();
Pattern stoppedTimePattern =
Pattern.compile("Application time: (d+.d+)");
while ( ( logRecord = logFileReader.readLine()) != null) {
Matcher matcher = stoppedTimePattern.matcher(logRecord);
if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1));
summary.add( value);
}
}
26. Java 8 Streams
• A sequence of values,“in motion”
• source and intermediate operations set the stream up lazily
• a terminal operation “pulls” values eagerly down the stream
collection.stream()
.intermediateOp
⋮
.intermediateOp
.terminalOp
27. Stream Sources
• New method Collection.stream()
• Many other sources:
• Arrays.stream(Object[])
• Streams.of(Object...)
• Stream.iterate(Object,UnaryOperator)
• Files.lines()
• BufferedReader.lines()
• Random.ints()
• JarFile.stream()
• …
37. Old School: 13.3 secs
Sequential: 13.8 secs
- Should be the same workload
- Stream code is cleaner, easier to read
How Does It Perform?
24M line file, MacBook Pro, Haswell i7, 4 cores, hyperthreaded, Java 9.0
38. Can We Do Better?
• We might be able to if the workload is parallelizable
• split stream into many segments
• process each segment
• combine results
• Requirements exactly match Fork/Join workflow
54. About Fork/Join
• Introduced in Java 7
• draws from a common pool of ForkJoinWorkerThread
• default pool size == HW cores – 1
• assumes workload will be CPU bound
• On its own, not an easy coding idiom
• parallel streams provide an abstraction layer
• Spliterator defines how to split stream
• framework code submits sub-tasks to the common Fork/Join pool
55. Old School: 13.3 secs
Sequential: 13.8 secs
Parallel: 9.5 secs
- 1.45x faster
- but not 8x faster (????)
How Does That Perform?
24M lines, 2.8GHz 8-core i7, 16GB, OS X, Java 9.0
56. In Fact!!!!
• Different benchmarks yield a mixed bag of results
• some were better
• some were the same
• some were worse!
57. Open Questions
• Under what conditions are things better
• or worse
• When should we parallelize
• and when is serial better
58. Open Questions
• Under what conditions are things better
• or worse
• When should we parallelize
• and when is serial better
Answer depends upon where the bottleneck is
59. Where is Our Bottleneck?
• I/O operations
• not a surprise, we’re reading from a file
• Java 9 uses FileChannelLineSpliterator
• 2x better than Java 8’s implementation
76.0% 0 + 5941 sun.nio.ch.FileDispatcherImpl.pread0
60. Poorly Splitting Sources
• Some sources split worse than others
• LinkedList vs ArrayList
• Streaming I/O is problematic
• more threads == more pressure on contended resource
• thrashing and other ill effects
• Workload size doesn’t cover the overheads
68. 5.342: … nds
LineSpliterator
2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n
spliterator coveragenew spliterator coverage
MappedByteBuffer mid
Included in JDK9 as FileChannelLinesSpliterator
70. Old School: 9.4 secs
Sequential: 9.9 secs
Parallel: 2.7 secs
- 4.25x faster
- better but still not 8x faster
In-memory Comparison
24M lines, 2.8GHz 8 core i7, 16GB, OS X, JDK 9.0
71. Justifying the Overhead
CPNQ performance model:
C - number of submitters
P - number of CPUs
N - number of elements
Q - cost of the operation
cost of intermediate operations is N * Q
overhead of setting up F/J framework is ~100µs
72. Amortizing Setup Costs
• N*Q needs to be large
• Q can often only be estimated
• N may only be known at run time
• Rule of thumb, N > 10,000
• P is the number of processors
• P == number for cores for CPU bound
• P < number of cores otherwise
73. Other Gotchas
• Frequent hand-offs place pressure on thread schedulers
• effect is magnified when a hypervisor is involved
• estimated 80,000 cycles to handoff data between threads
• you can do a lot of processing in 80,000 cycles
• Too many threads places pressure on thread schedulers
• responsible for other ill effects (TTSP)
• too few threads may leave hardware under-utilized
74. Simulated Server Environment
ExecutorService threadPool = Executors.newFixedThreadPool(10);
threadPool.execute(() -> {
try {
long timer = System.currentTimeMillis();
value = Files.lines( new File(“gc.log").toPath()).parallel()
.map(applicationStoppedTimePattern::matcher)
.filter(Matcher::find)
.map( matcher -> matcher.group(2))
.mapToDouble(Double::parseDouble)
.summaryStatistics().getSum();
} catch (Exception ex) {}
});
75. Work Flow and Results
• First task to arrive will consume all ForkJoinWorkerThread
• downstream tasks wait for a ForkJoinWorkerThread
• downstream tasks start intermixing with initial task
• Initial task collects dead time as it competes for threads
• all other tasks collect dead time as they either
• compete or wait for a ForkJoinWorkerThread
76. Work Flow and Results
• First task to arrive will consume all ForkJoinWorkerThread
• downstream tasks wait for a ForkJoinWorkerThread
• downstream tasks start intermixing with initial task
• Initial task collects dead time as it competes for threads
• all other tasks collect dead time as they either
• compete or wait for a ForkJoinWorkerThread
System is stressed beyond capacity
78. Intermediate Operation Bottleneck
• Bottleneck is in pattern matching
• but, streaming infrastructure isn’t far behind!
68.6% 1384 + 0 java.util.regex.Pattern$Curly.match
26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept
79. Tragedy of the Commons
Garrett Hardin, ecologist (1968):
Imagine the grazing of animals on a common ground. Each
flock owner gains if they add to their own flock. But
every animal added to the total degrades the commons a
small amount.
81. Tragedy of the Commons
You have a finite amount of hardware
– it might be in your best interest to grab it all
– but if everyone behaves the same way…
83. Simulated Server Environment
• Submit 10 tasks to Fork-Join (via Executor thread-pool)
• first result comes out in 32 seconds
• compared to 9.5 seconds for individually submitted task
• high system time reflects task is I/O bounded
86. In-MemoryVariation
• Preload log file
• Submit 10 tasks to Fork-Join (via Executor thread-pool)
• first result comes out in 23 seconds
• compared to 4.5 seconds for individually submitted task
• task is CPU bound
87. Conclusions
Sequential stream performance comparable to imperative code
Going parallel is worthwhile IF
- task is suitable
- expensive enough to amortize setup costs
- no inter-task communication needed
- data source is suitable
- environment is suitable
Need to monitor JDK to understanding bottlenecks
- Fork/Join pool is not well instrumented