Shooting the Rapids

Shooting the Rapids:
Getting the Best from Java 8
Streams
Kirk Pepperdine @kcpeppe
Maurice Naftalin @mauricenaftalin
Devoxx Belgium, Nov. 2015

• Specialises in performance tuning
• speaks frequently about performance
• author of performance tuning workshop
• Co-founder
• performance diagnostic tooling
• Java Champion (since 2006)
About Kirk

About Maurice
Co-author Author

About Maurice
Co-author Author
Java
Champion
JavaOne
Rock Star

Subjects Covered in this Talk
• Background – lambdas and streams
• Performance of our example
• Effect of parallelizing
• Splitting input data efﬁciently
• When to go parallel
• Parallel streams in the real world

Predicate<Matcher> matches = new Predicate<Matcher>() { 
@Override 
public boolean test(Matcher matcher) { 
return matcher.find(); 
} 
};
What is a Lambda?
matcher
matcher.find()
matcher
matcher.find()

@Override 
} 
};
Predicate<Matcher> matches =
What is a Lambda?
matcher
matcher.find()
matcher
matcher.find()

@Override 
} 
};
What is a Lambda?
matcher
matcher.find()
matcher
matcher.find()

@Override 
} 
};
What is a Lambda?
matcherPredicate<Matcher> matches =
matcher.find()
matcher
matcher.find()

@Override 
} 
};
What is a Lambda?
matcher.find()
->
matcher
matcher.find()

@Override 
} 
};
What is a Lambda?
matcherPredicate<Matcher> matches = matcher.find()->
matcher
matcher.find()

@Override 
} 
};
What is a Lambda?
A lambda is a function
from arguments to result
matcher.find()->
matcher
matcher.find()

Example: Processing GC Logﬁle
⋮
2.869: Application time: 1.0001540 seconds
⋮

Example: Processing GC Logﬁle
⋮
⋮
DoubleSummaryStatistics
{count=3, sum=2.181635, min=0.080123, average=0.727212, max=1.101357}

Old School Code
DoubleSummaryStatistic summary = new DoubleSummaryStatistic();
Pattern stoppedTimePattern =
Pattern.compile("Application time: (d+.d+)");
 
while ( ( logRecord = logFileReader.readLine()) != null) { 
Matcher matcher = stoppedTimePattern.matcher(logRecord); 
if ( matcher.find()) {
double value = Double.parseDouble( matcher.group(1)); 
summary.add( value); 
} 
}

Old School Code
 
} 
}
Let’s look at the features in this code

Data Source
 
} 
}

Map to Matcher
 
} 
}

Filter
 
} 
}

Map to Double
 
} 
}

Collect Results (Reduce)
 
} 
}

Java 8 Streams
• A sequence of values,“in motion”
• source and intermediate operations set the stream up lazily
• a terminal operation “pulls” values eagerly down the stream
collection.stream()
.intermediateOp
⋮
.intermediateOp
.terminalOp

Stream Sources
• New method Collection.stream()
• Many other sources:
• Arrays.stream(Object[])
• Streams.of(Object...)
• Stream.iterate(Object,UnaryOperator)
• Files.lines()
• BufferedReader.lines()
• Random.ints()
• JarFile.stream()
• …

Imperative to Stream
DoubleSummaryStatistics statistics = 
Files.lines(new File(“gc.log”).toPath()) 
.map(stoppedTimePattern::matcher) 
.filter(Matcher::find) 
.map(matcher -> matcher.group(1))
.mapToDouble(Double::parseDouble)
.summaryStatistics();

Stream Source

Intermediate Operations

Method References

Terminal Operation

Visualising Sequential Streams
x2x0 x1 x3x0 x1 x2 x3
Source Map Filter Reduction
Intermediate
Operations
Terminal
Operation
“Values in Motion”

x2x0 x1 x3x1 x2 x3 ✔
Intermediate
Operations
Terminal
Operation

x2x0 x1 x3 x1x2 x3 ❌✔
Intermediate
Operations
Terminal
Operation

x2x0 x1 x3 x1x2x3 ❌✔
Intermediate
Operations
Terminal
Operation

Old School: 13.3 secs
Sequential: 13.8 secs
- Should be the same workload
- Stream code is cleaner, easier to read
How Does It Perform?
24M line ﬁle, MacBook Pro, Haswell i7, 4 cores, hyperthreaded, Java 9.0

Can We Do Better?
• We might be able to if the workload is parallelizable
• split stream into many segments
• process each segment
• combine results
• Requirements exactly match Fork/Join workﬂow

x2
Visualizing Parallel Streams
x0
x1
x3
x0
x1
x2
x3

x2
x1
x3
x0
x1
x3
✔
❌

x2
x1 y3
x0
x1
x3
✔
❌

Splitting Stream Sources
• Stream source is a Spliterator
• can both iterate over data and – where possible – split it

Parallel Streams
Files.lines(new File(“gc.log”).toPath())
.parallel() 

About Fork/Join
• Introduced in Java 7
• draws from a common pool of ForkJoinWorkerThread
• default pool size == HW cores – 1
• assumes workload will be CPU bound
• On its own, not an easy coding idiom
• parallel streams provide an abstraction layer
• Spliterator deﬁnes how to split stream
• framework code submits sub-tasks to the common Fork/Join pool

Parallel: 9.5 secs
- 1.45x faster
- but not 8x faster (????)
How Does That Perform?
24M lines, 2.8GHz 8-core i7, 16GB, OS X, Java 9.0

In Fact!!!!
• Different benchmarks yield a mixed bag of results
• some were better
• some were the same
• some were worse!

Open Questions
• Under what conditions are things better
• or worse
• When should we parallelize
• and when is serial better

Open Questions
• Under what conditions are things better
• or worse
• When should we parallelize
• and when is serial better
Answer depends upon where the bottleneck is

Where is Our Bottleneck?
• I/O operations
• not a surprise, we’re reading from a ﬁle
• Java 9 uses FileChannelLineSpliterator
• 2x better than Java 8’s implementation
76.0% 0 + 5941 sun.nio.ch.FileDispatcherImpl.pread0

Poorly Splitting Sources
• Some sources split worse than others
• LinkedList vs ArrayList
• Streaming I/O is problematic
• more threads == more pressure on contended resource
• thrashing and other ill effects
• Workload size doesn’t cover the overheads

Streaming I/O Bottleneck
x2x0 x1 x3x0 x1 x2 x3

Streaming I/O Bottleneck
✔
❌
x2x1x0 x1 x3

5.342: … nds
LineSpliterator
2.869:Applicati … seconds n 8.382: … nds 9.337:App … ndsn n n
spliterator coverage

5.342: … nds
LineSpliterator
MappedByteBuffer

5.342: … nds
LineSpliterator
MappedByteBuffer mid

5.342: … nds
LineSpliterator
spliterator coveragenew spliterator coverage

5.342: … nds
LineSpliterator
spliterator coveragenew spliterator coverage
Included in JDK9 as FileChannelLinesSpliterator

In-memory Comparison
• Read GC log into an ArrayList prior to processing

Parallel: 2.7 secs
- 4.25x faster
- better but still not 8x faster
In-memory Comparison
24M lines, 2.8GHz 8 core i7, 16GB, OS X, JDK 9.0

Justifying the Overhead
CPNQ performance model:
C - number of submitters
P - number of CPUs
N - number of elements
Q - cost of the operation
cost of intermediate operations is N * Q
overhead of setting up F/J framework is ~100µs

Amortizing Setup Costs
• N*Q needs to be large
• Q can often only be estimated
• N may only be known at run time
• Rule of thumb, N > 10,000
• P is the number of processors
• P == number for cores for CPU bound
• P < number of cores otherwise

Other Gotchas
• Frequent hand-offs place pressure on thread schedulers
• effect is magniﬁed when a hypervisor is involved
• estimated 80,000 cycles to handoff data between threads
• you can do a lot of processing in 80,000 cycles
• Too many threads places pressure on thread schedulers
• responsible for other ill effects (TTSP)
• too few threads may leave hardware under-utilized

Simulated Server Environment
ExecutorService threadPool = Executors.newFixedThreadPool(10);
threadPool.execute(() -> {
try {
long timer = System.currentTimeMillis();
value = Files.lines( new File(“gc.log").toPath()).parallel()
.map(applicationStoppedTimePattern::matcher)
.filter(Matcher::find)
.map( matcher -> matcher.group(2))
.summaryStatistics().getSum();
} catch (Exception ex) {}
});

Work Flow and Results
• First task to arrive will consume all ForkJoinWorkerThread
• downstream tasks wait for a ForkJoinWorkerThread
• downstream tasks start intermixing with initial task
• Initial task collects dead time as it competes for threads
• all other tasks collect dead time as they either
• compete or wait for a ForkJoinWorkerThread

Work Flow and Results
• First task to arrive will consume all ForkJoinWorkerThread
• downstream tasks wait for a ForkJoinWorkerThread
• downstream tasks start intermixing with initial task
• Initial task collects dead time as it competes for threads
• all other tasks collect dead time as they either
• compete or wait for a ForkJoinWorkerThread
System is stressed beyond capacity

Intermediate Operation Bottleneck
68.6% 1384 + 0 java.util.regex.Pattern$Curly.match
26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept

Intermediate Operation Bottleneck
• Bottleneck is in pattern matching
• but, streaming infrastructure isn’t far behind!
68.6% 1384 + 0 java.util.regex.Pattern$Curly.match
26.6% 521 + 15 java.util.stream.ReferencePipeline$3$1.accept

Tragedy of the Commons
Garrett Hardin, ecologist (1968):
Imagine the grazing of animals on a common ground. Each
ﬂock owner gains if they add to their own ﬂock. But
every animal added to the total degrades the commons a
small amount.

Tragedy of the Commons
You have a ﬁnite amount of hardware
– it might be in your best interest to grab it all
– but if everyone behaves the same way…

Simulated Server Environment
• Submit 10 tasks to Fork-Join (via Executor thread-pool)
• ﬁrst result comes out in 32 seconds
• compared to 9.5 seconds for individually submitted task
• high system time reﬂects task is I/O bounded

In-MemoryVariation
• Preload log ﬁle

In-MemoryVariation
• Preload log ﬁle
• Submit 10 tasks to Fork-Join (via Executor thread-pool)
• ﬁrst result comes out in 23 seconds
• compared to 4.5 seconds for individually submitted task
• task is CPU bound

Conclusions
Sequential stream performance comparable to imperative code
Going parallel is worthwhile IF
- task is suitable
- expensive enough to amortize setup costs
- no inter-task communication needed
- data source is suitable
- environment is suitable
Need to monitor JDK to understanding bottlenecks
- Fork/Join pool is not well instrumented

Resources
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html

Shooting the Rapids

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Shooting the Rapids

Similar to Shooting the Rapids (20)

Recently uploaded

Recently uploaded (20)

Shooting the Rapids