Map(), flatMap() and reduce() simplify collections, concurrency and big data

Map(), ﬂatMap() and reduce()
are your new best friends:
simpler collections,
concurrency, and big data
Chris Richardson
Author of POJOs in Action
Founder of the original CloudFoundry.com
@crichardson
chris@chrisrichardson.net
http://plainoldobjects.com

@crichardson
Presentation goal
How functional programming simpliﬁes your code
Show that
map(), ﬂatMap() and reduce()
are remarkably versatile functions

@crichardson
About Chris
Founder of a buzzword compliant (stealthy, social, mobile, big data, machine
learning, ...) startup
Consultant helping organizations improve how they architect and deploy
applications using cloud, micro services, polyglot applications, NoSQL, ...

@crichardson
Agenda
Why functional programming?
Simplifying collection processing
Simplifying concurrency with Futures and Rx Observables
Tackling big data problems with functional programming

@crichardson
Functional programming is a programming paradigm
Functions are the building blocks of the application
Best done in a functional programming language

@crichardson
Functions as ﬁrst class citizens
Assign functions to variables
Store functions in ﬁelds
Use and write higher-order functions:
Pass functions as arguments
Return functions as values

@crichardson
Avoids mutable state
Use:
Immutable data structures
Single assignment variables
Some functional languages such as Haskell don’t allow side-effects

@crichardson
"the highest goal of programming-
language design to enable good
ideas to be elegantly expressed"
http://en.wikipedia.org/wiki/Tony_Hoare

@crichardson
More expressive
More intuitive - declarative code matches problem deﬁnition
Functional code is usually much more composable
Immutable state:
Less error-prone
Easy parallelization and concurrency
But be pragmatic

@crichardson
An ancient idea that has recently
become popular

@crichardson
Mathematical foundation:
λ-calculus
Introduced by
Alonzo Church in the 1930s

@crichardson
Lisp = an early functional language
invented in 1958
http://en.wikipedia.org/wiki/Lisp_(programming_language)
1940
1950
1960
1970
1980
1990
2000
2010
garbage collection
dynamic typing
self-hosting compiler
tree data structures
(defun factorial (n)
(if (<= n 1)
1
(* n (factorial (- n 1)))))

@crichardson
My ﬁnal year project in 1985:
Implementing SASL
sieve (p:xs) =
p : sieve [x | x <- xs, rem x p > 0];
primes = sieve [2..]
A list of integers starting with 2
Filter out multiples of p

Mostly an Ivory Tower technology
Lisp was used for AI
FP languages: Miranda, ML,
Haskell, ...
“Side-eﬀects kills
kittens and puppies”

@crichardson
http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html
!*
!*
!*

@crichardson
But today FP is mainstream
Clojure - a dialect of Lisp
A hybrid OO/functional language
A hybrid OO/FP language for .NET
Java 8 has lambda expressions

@crichardson
Java 8 lambda expressions are
functions x -> x * x
x -> {
for (int i = 2; i < Math.sqrt(x); i = i + 1) {
if (x % i == 0)
return false;
}
return true;
};
(x, y) -> x * x + y * y
An instance of an anonymous inner class that
implements a functional interface (kinda)

@crichardson
Lot’s of application code
=
collection processing:
Mapping, ﬁltering, and reducing

@crichardson
Social network example
public class Person {
enum Gender { MALE, FEMALE }
private Name name;
private LocalDate birthday;
private Gender gender;
private Hometown hometown;
private Set<Friend> friends = new HashSet<Friend>();
....
public class Friend {
private Person friend;
private LocalDate becameFriends;
...
}
public class SocialNetwork {
private Set<Person> people;
...

@crichardson
Typical iterative code - e.g. ﬁltering
...
public Set<Person> lonelyPeople() {
Set<Person> result = new HashSet<Person>();
for (Person p : people) {
if (p.getFriends().isEmpty())
result.add(p);
}
return result;
}
Declare result variable
Modify result
Return result
Iterate

@crichardson
Problems with this style of programming
Low level
Imperative (how to do it) NOT declarative (what to do)
Verbose
Mutable variables are potentially error prone
Difﬁcult to parallelize

@crichardson
Java 8 streams to the rescue
A sequence of elements
“Wrapper” around a collection (and other types: e.g. JarFile.stream(), Files.lines())
Streams can also be inﬁnite
Provides a functional/lambda-based API for transforming, ﬁltering and aggregating
elements
Much simpler, cleaner and declarative
code

@crichardson
...
public Set<Person> peopleWithNoFriends() {
Set<Person> result = new HashSet<Person>();
for (Person p : people) {
if (p.getFriends().isEmpty())
result.add(p);
}
return result;
}
Using Java 8 streams - ﬁltering
...
public Set<Person> lonelyPeople() {
return people.stream()
.filter(p -> p.getFriends().isEmpty())
.collect(Collectors.toSet());
}
predicate
lambda expression

@crichardson
The ﬁlter() function
s1 a b c d e ...
s2 a c d ...
s2 = s1.ﬁlter(f)
Elements that satisfy predicate f

@crichardson
Using Java 8 streams - mapping
class Person ..
private Set<Friend> friends = ...;
public Set<Hometown> hometownsOfFriends() {
return friends.stream()
.map(f -> f.getPerson().getHometown())
}

@crichardson
The map() function
s1 a b c d e ...
s2 f(a) f(b) f(c) f(d) f(e) ...
s2 = s1.map(f)

@crichardson
Using Java 8 streams - friend of friends
using ﬂatMap
class Person ..
public Set<Person> friendOfFriends() {
return friends.stream()
.flatMap(friend -> friend.getPerson().friends.stream())
.map(Friend::getPerson)
.filter(f -> f != this)
}
maps and ﬂattens

@crichardson
The ﬂatMap() function
s1 a b ...
s2 f(a)0 f(a)1 f(b)0 f(b)1 f(b)2 ...
s2 = s1.ﬂatMap(f)

@crichardson
Using Java 8 streams - reducing
...
public long averageNumberOfFriends() {
return people.stream()
.map ( p -> p.getFriends().size() )
.reduce(0, (x, y) -> x + y)
/ people.size();
} int x = 0;
for (int y : inputStream)
x = x + y
return x;

@crichardson
The reduce() function
s1 a b c d e ...
x = s1.reduce(initial, f)
f(f(f(f(f(f(initial, a), b), c), d), e), ...)

@crichardson
Adopting FP with Java 8 is
straightforward
Simply start using streams and lambdas
Eclipse can refactor anonymous inner classes to lambdas

@crichardson
Let’s imagine
that you are writing code to display the
products in a user’s wish list

@crichardson
The need for concurrency
Step #1
Web service request to get the user proﬁle including wish list (list of product Ids)
Step #2
For each productId: web service request to get product info
But
Getting products sequentially terrible response time
Need fetch productInfo concurrently
Composing sequential + scatter/gather-style
operations is very common

@crichardson
Futures are a great abstraction for
composing concurrent operations
http://en.wikipedia.org/wiki/Futures_and_promises

@crichardson
Worker thread or event-
driven code
Main thread
Composition with futures
Outcome
Future 2
Client
get Asynchronous
operation 2
set
initiates
Asynchronous
operation 1
Outcome
Future 1
get
set

@crichardson
But composition with basic futures is
difficult
Java 7 future.get([timeout]):
Blocking API client blocks thread
Difficult to compose multiple concurrent operations
Futures with callbacks:
e.g. Guava ListenableFutures, Spring 4 ListenableFuture
Attach callbacks to all futures and asynchronously consume outcomes
But callback-based code = messy code
See http://techblog.netflix.com/2013/02/rxjava-netflix-api.html
We need functional futures!

@crichardson
Functional futures - Scala, Java 8 CompletableFuture
def asyncPlus(x : Int, y : Int) : Future[Int] = ... x + y ...
val future2 = asyncPlus(4, 5).map{ _ * 3 }
assertEquals(27, Await.result(future2, 1 second))
Asynchronously transforms
future
def asyncSquare(x : Int) : Future[Int] = ... x * x ...
val f2 = asyncPlus(5, 8).flatMap { x => asyncSquare(x) }
assertEquals(169, Await.result(f2, 1 second))
Calls asyncSquare() with
the eventual outcome of
asyncPlus()

@crichardson
Functions like map() are asynchronous
someFn(outcome1)
f2
f2 = f1 map (someFn) Outcome1
f1
Implemented using callbacks

@crichardson
class WishListService(...) {
def getWishList(userId : Long) : Future[WishList] = {
userService.getUserProfile(userId).
Scala wish list service
Java 8 Completable Futures let you write similar code
Future[UserProﬁle]
map { userProfile => userProfile.wishListProductIds}.
flatMap { productIds =>
val listOfProductFutures =
productIds map productInfoService.getProductInfo
Future.sequence(listOfProductFutures)
}.
map { products => WishList(products) }
Future[List[Long]]
List[Future[ProductInfo]]
Future[List[ProductInfo]]
Future[WishList]

@crichardson
Your mouse is your database
Erik Meijer
http://queue.acm.org/detail.cfm?id=2169076

@crichardson
Introducing Reactive Extensions (Rx)
The Reactive Extensions (Rx) is a library for composing asynchronous and
event-based programs using observable sequences and LINQ-style query
operators. Using Rx, developers represent asynchronous data streams
with Observables , query asynchronous data streams using LINQ
operators , and .....
https://rx.codeplex.com/

@crichardson
About RxJava
Reactive Extensions (Rx) for the JVM
Original motivation for Netﬂix was to provide rich Futures
Implemented in Java
Adaptors for Scala, Groovy and Clojure
Embraced by Akka and Spring Reactor: http://www.reactive-streams.org/
https://github.com/Netﬂix/RxJava

@crichardson
RxJava core concepts
trait Observable[T] {
def subscribe(observer : Observer[T]) : Subscription
...
}
trait Observer[T] {
def onNext(value : T)
def onCompleted()
def onError(e : Throwable)
}
Notiﬁes
An asynchronous stream of items
Used to unsubscribe

Comparing Observable to...
Observer pattern - similar but adds
Observer.onComplete()
Observer.onError()
Iterator pattern - mirror image
Push rather than pull
Futures - similar
Can be used as Futures
But Observables = a stream of
multiple values
Collections and Streams - similar
Functional API supporting map(),
ﬂatMap(), ...
But Observables are asynchronous

@crichardson
Fun with observables
val every10Seconds = Observable.interval(10 seconds)
-1 0 1 ...
t=0 t=10 t=20 ...
val oneItem = Observable.items(-1L)
val ticker = oneItem ++ every10Seconds
val subscription = ticker.subscribe { (value: Long) => println("value=" + value) }
...
subscription.unsubscribe()

@crichardson
def getTableStatus(tableName: String) : Observable[DynamoDbStatus]=
Observable { subscriber: Subscriber[DynamoDbStatus] =>
}
Connecting observables to the outside
world
amazonDynamoDBAsyncClient.describeTableAsync(
new DescribeTableRequest(tableName),
new AsyncHandler[DescribeTableRequest, DescribeTableResult] {
override def onSuccess(request: DescribeTableRequest, result: DescribeTableResult) = {
subscriber.onNext(DynamoDbStatus(result.getTable.getTableStatus))
subscriber.onCompleted()
}
override def onError(exception: Exception) = exception match {
case t: ResourceNotFoundException =>
subscriber.onNext(DynamoDbStatus("NOT_FOUND"))
subscriber.onCompleted()
case _ =>
subscriber.onError(exception)
}
})
}
Called once per subscriber
Asynchronously gets information
about DynamoDB table

@crichardson
Transforming observables
val tableStatus : Observable[DynamoDbMessage] = ticker.flatMap { i =>
logger.info("{}th describe table", i + 1)
getTableStatus(name)
}
Status1 Status2 Status3 ...
t=0 t=10 t=20 ...
+ Usual collection methods: map(), filter(), take(), drop(), ...

@crichardson
Calculating rolling average
class AverageTradePriceCalculator {
def calculateAverages(trades: Observable[Trade]):
Observable[AveragePrice] = {
...
}
case class Trade(
symbol : String,
price : Double,
quantity : Int
...
)
case class AveragePrice(
symbol : String,
price : Double,
...)

@crichardson
Calculating average prices
def calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = {
trades.groupBy(_.symbol).
map { case (symbol, tradesForSymbol) =>
val openingEverySecond =
Observable.items(-1L) ++ Observable.interval(1 seconds)
def closingAfterSixSeconds(opening: Any) =
Observable.interval(6 seconds).take(1)
tradesForSymbol.window(openingEverySecond, closingAfterSixSeconds _).map {
windowOfTradesForSymbol =>
windowOfTradesForSymbol.fold((0.0, 0, List[Double]())) { (soFar, trade) =>
val (sum, count, prices) = soFar
(sum + trade.price, count + trade.quantity, trade.price +: prices)
} map { case (sum, length, prices) =>
AveragePrice(symbol, sum / length, prices)
}
}.flatten
}.flatten
}
Create an Observable of per-symbol Observables
Create an Observable of per-symbol Observables

@crichardson
Let’s imagine that you want to count
word frequencies

@crichardson
Scala Word Count
val frequency : Map[String, Int] =
Source.fromFile("gettysburgaddress.txt").getLines()
.flatMap { _.split(" ") }.toList
frequency("THE") should be(11)
frequency("LIBERTY") should be(1)
.groupBy(identity)
.mapValues(_.length))
Map
Reduce

@crichardson
But how to scale to a cluster of
machines?

@crichardson
Apache Hadoop
Open-source software for reliable, scalable, distributed computing
Hadoop Distributed File System (HDFS)
Efﬁciently stores very large amounts of data
Files are partitioned and replicated across multiple machines
Hadoop MapReduce
Batch processing system
Provides plumbing for writing distributed jobs
Handles failures
...

@crichardson
Overview of MapReduce
Input
Data
Mapper
Mapper
Mapper
Reducer
Reducer
Reducer
Output
Data
Shufﬂe
(K,V)
(K,V)
(K,V)
(K,V)*
(K,V)*
(K,V)*
(K1,V, ....)*
(K2,V, ....)*
(K3,V, ....)*
(K,V)
(K,V)
(K,V)

@crichardson
MapReduce Word count - mapper
class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
(“Four”, 1), (“score”, 1), (“and”, 1), (“seven”, 1), ...
Four score and seven years
http://wiki.apache.org/hadoop/WordCount

@crichardson
Hadoop then shufﬂes the key-value
pairs...

@crichardson
MapReduce Word count - reducer
class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,
Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
(“the”, 11)
(“the”, (1, 1, 1, 1, 1, 1, ...))
http://wiki.apache.org/hadoop/WordCount

@crichardson
About MapReduce
Very simple programming abstract yet incredibly powerful
By chaining together multiple map/reduce jobs you can process very large amounts of
data in interesting ways
e.g. Apache Mahout for machine learning
But
Mappers and Reducers = verbose code
Development is challenging, e.g. unit testing is difﬁcult
It’s disk-based, batch processing slow

@crichardson
Scalding: Scala DSL for MapReduce
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
def tokenize(text : String) : Array[String] = {
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "")
.split("s+")
}
}
https://github.com/twitter/scalding
Expressive and unit testable
Each row is a map of named ﬁelds

@crichardson
Apache Spark
Part of the Hadoop ecosystem
Key abstraction = Resilient Distributed Datasets (RDD)
Collection that is partitioned across cluster members
Operations are parallelized
Created from either a Scala collection or a Hadoop supported datasource - HDFS, S3 etc
Can be cached in-memory for super-fast performance
Can be replicated for fault-tolerance
REPL for executing ad hoc queries
http://spark.apache.org

@crichardson
Spark Word Count
val sc = new SparkContext(...)
sc.textFile("s3n://mybucket/...")
.flatMap { _.split(" ")}
.groupBy(identity)
.mapValues(_.length)
.toArray.toMap
}
}
Expressive, unit testable and very fast

@crichardson
Summary
Functional programming enables the elegant expression of good ideas in a wide
variety of domains
map(), ﬂatMap() and reduce() are remarkably versatile higher-order functions
Use FP and OOP together
Java 8 has taken a good ﬁrst step towards supporting FP

@crichardson
Questions?
@crichardson chris@chrisrichardson.net
http://plainoldobjects.com

Map(), flatMap() and reduce() simplify collections, concurrency and big data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Map(), flatMap() and reduce() simplify collections, concurrency and big data

Ähnlich wie Map(), flatMap() and reduce() simplify collections, concurrency and big data (20)

Mehr von Chris Richardson

Mehr von Chris Richardson (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Map(), flatMap() and reduce() simplify collections, concurrency and big data