Anzeige

Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

Founder of Eventuate, Inc - a microservices startup um Eventuate, Inc - a microservices startup
30. Jul 2014
Anzeige

Más contenido relacionado

Presentaciones para ti(20)

Similar a Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)(20)

Anzeige

Más de Chris Richardson(20)

Anzeige

Map, Flatmap and Reduce are Your New Best Friends: Simpler Collections, Concurrency, and Big Data (#oscon)

  1. Map(), flatMap() and reduce() are your new best friends: simpler collections, concurrency, and big data Chris Richardson Author of POJOs in Action Founder of the original CloudFoundry.com @crichardson chris@chrisrichardson.net http://plainoldobjects.com
  2. @crichardson Presentation goal How functional programming simplifies your code Show that map(), flatMap() and reduce() are remarkably versatile functions
  3. @crichardson About Chris
  4. @crichardson About Chris Founder of a buzzword compliant (stealthy, social, mobile, big data, machine learning, ...) startup Consultant helping organizations improve how they architect and deploy applications using cloud, micro services, polyglot applications, NoSQL, ...
  5. @crichardson Agenda Why functional programming? Simplifying collection processing Simplifying concurrency with Futures and Rx Observables Tackling big data problems with functional programming
  6. @crichardson Functional programming is a programming paradigm Functions are the building blocks of the application Best done in a functional programming language
  7. @crichardson Functions as first class citizens Assign functions to variables Store functions in fields Use and write higher-order functions: Pass functions as arguments Return functions as values
  8. @crichardson Avoids mutable state Use: Immutable data structures Single assignment variables Some functional languages such as Haskell don’t allow side-effects
  9. @crichardson Why functional programming? "the highest goal of programming- language design to enable good ideas to be elegantly expressed" http://en.wikipedia.org/wiki/Tony_Hoare
  10. @crichardson Why functional programming? More expressive More intuitive - declarative code matches problem definition Functional code is usually much more composable Immutable state: Less error-prone Easy parallelization and concurrency But be pragmatic
  11. @crichardson An ancient idea that has recently become popular
  12. @crichardson Mathematical foundation: λ-calculus Introduced by Alonzo Church in the 1930s
  13. @crichardson Lisp = an early functional language invented in 1958 http://en.wikipedia.org/wiki/Lisp_(programming_language) 1940 1950 1960 1970 1980 1990 2000 2010 garbage collection dynamic typing self-hosting compiler tree data structures (defun factorial (n) (if (<= n 1) 1 (* n (factorial (- n 1)))))
  14. @crichardson My final year project in 1985: Implementing SASL sieve (p:xs) = p : sieve [x | x <- xs, rem x p > 0]; primes = sieve [2..] A list of integers starting with 2 Filter out multiples of p
  15. Mostly an Ivory Tower technology Lisp was used for AI FP languages: Miranda, ML, Haskell, ... “Side-effects kills kittens and puppies”
  16. @crichardson http://steve-yegge.blogspot.com/2010/12/haskell-researchers-announce-discovery.html !* !* !*
  17. @crichardson But today FP is mainstream Clojure - a dialect of Lisp A hybrid OO/functional language A hybrid OO/FP language for .NET Java 8 has lambda expressions
  18. @crichardson Java 8 lambda expressions are functions x -> x * x x -> { for (int i = 2; i < Math.sqrt(x); i = i + 1) { if (x % i == 0) return false; } return true; }; (x, y) -> x * x + y * y An instance of an anonymous inner class that implements a functional interface (kinda)
  19. @crichardson Agenda Why functional programming? Simplifying collection processing Simplifying concurrency with Futures and Rx Observables Tackling big data problems with functional programming
  20. @crichardson Lot’s of application code = collection processing: Mapping, filtering, and reducing
  21. @crichardson Social network example public class Person { enum Gender { MALE, FEMALE } private Name name; private LocalDate birthday; private Gender gender; private Hometown hometown; private Set<Friend> friends = new HashSet<Friend>(); .... public class Friend { private Person friend; private LocalDate becameFriends; ... } public class SocialNetwork { private Set<Person> people; ...
  22. @crichardson Typical iterative code - e.g. filtering public class SocialNetwork { private Set<Person> people; ... public Set<Person> lonelyPeople() { Set<Person> result = new HashSet<Person>(); for (Person p : people) { if (p.getFriends().isEmpty()) result.add(p); } return result; } Declare result variable Modify result Return result Iterate
  23. @crichardson Problems with this style of programming Low level Imperative (how to do it) NOT declarative (what to do) Verbose Mutable variables are potentially error prone Difficult to parallelize
  24. @crichardson Java 8 streams to the rescue A sequence of elements “Wrapper” around a collection (and other types: e.g. JarFile.stream(), Files.lines()) Streams can also be infinite Provides a functional/lambda-based API for transforming, filtering and aggregating elements Much simpler, cleaner and declarative code
  25. @crichardson public class SocialNetwork { private Set<Person> people; ... public Set<Person> peopleWithNoFriends() { Set<Person> result = new HashSet<Person>(); for (Person p : people) { if (p.getFriends().isEmpty()) result.add(p); } return result; } Using Java 8 streams - filtering public class SocialNetwork { private Set<Person> people; ... public Set<Person> lonelyPeople() { return people.stream() .filter(p -> p.getFriends().isEmpty()) .collect(Collectors.toSet()); } predicate lambda expression
  26. @crichardson The filter() function s1 a b c d e ... s2 a c d ... s2 = s1.filter(f) Elements that satisfy predicate f
  27. @crichardson Using Java 8 streams - mapping class Person .. private Set<Friend> friends = ...; public Set<Hometown> hometownsOfFriends() { return friends.stream() .map(f -> f.getPerson().getHometown()) .collect(Collectors.toSet()); }
  28. @crichardson The map() function s1 a b c d e ... s2 f(a) f(b) f(c) f(d) f(e) ... s2 = s1.map(f)
  29. @crichardson Using Java 8 streams - friend of friends using flatMap class Person .. public Set<Person> friendOfFriends() { return friends.stream() .flatMap(friend -> friend.getPerson().friends.stream()) .map(Friend::getPerson) .filter(f -> f != this) .collect(Collectors.toSet()); } maps and flattens
  30. @crichardson The flatMap() function s1 a b ... s2 f(a)0 f(a)1 f(b)0 f(b)1 f(b)2 ... s2 = s1.flatMap(f)
  31. @crichardson Using Java 8 streams - reducing public class SocialNetwork { private Set<Person> people; ... public long averageNumberOfFriends() { return people.stream() .map ( p -> p.getFriends().size() ) .reduce(0, (x, y) -> x + y) / people.size(); } int x = 0; for (int y : inputStream) x = x + y return x;
  32. @crichardson The reduce() function s1 a b c d e ... x = s1.reduce(initial, f) f(f(f(f(f(f(initial, a), b), c), d), e), ...)
  33. @crichardson Adopting FP with Java 8 is straightforward Simply start using streams and lambdas Eclipse can refactor anonymous inner classes to lambdas
  34. @crichardson Agenda Why functional programming? Simplifying collection processing Simplifying concurrency with Futures and Rx Observables Tackling big data problems with functional programming
  35. @crichardson Let’s imagine that you are writing code to display the products in a user’s wish list
  36. @crichardson The need for concurrency Step #1 Web service request to get the user profile including wish list (list of product Ids) Step #2 For each productId: web service request to get product info But Getting products sequentially terrible response time Need fetch productInfo concurrently Composing sequential + scatter/gather-style operations is very common
  37. @crichardson Futures are a great abstraction for composing concurrent operations http://en.wikipedia.org/wiki/Futures_and_promises
  38. @crichardson Worker thread or event- driven code Main thread Composition with futures Outcome Future 2 Client get Asynchronous operation 2 set initiates Asynchronous operation 1 Outcome Future 1 get set
  39. @crichardson But composition with basic futures is difficult Java 7 future.get([timeout]): Blocking API client blocks thread Difficult to compose multiple concurrent operations Futures with callbacks: e.g. Guava ListenableFutures, Spring 4 ListenableFuture Attach callbacks to all futures and asynchronously consume outcomes But callback-based code = messy code See http://techblog.netflix.com/2013/02/rxjava-netflix-api.html We need functional futures!
  40. @crichardson Functional futures - Scala, Java 8 CompletableFuture def asyncPlus(x : Int, y : Int) : Future[Int] = ... x + y ... val future2 = asyncPlus(4, 5).map{ _ * 3 } assertEquals(27, Await.result(future2, 1 second)) Asynchronously transforms future def asyncSquare(x : Int) : Future[Int] = ... x * x ... val f2 = asyncPlus(5, 8).flatMap { x => asyncSquare(x) } assertEquals(169, Await.result(f2, 1 second)) Calls asyncSquare() with the eventual outcome of asyncPlus()
  41. @crichardson Functions like map() are asynchronous someFn(outcome1) f2 f2 = f1 map (someFn) Outcome1 f1 Implemented using callbacks
  42. @crichardson class WishListService(...) { def getWishList(userId : Long) : Future[WishList] = { userService.getUserProfile(userId). Scala wish list service Java 8 Completable Futures let you write similar code Future[UserProfile] map { userProfile => userProfile.wishListProductIds}. flatMap { productIds => val listOfProductFutures = productIds map productInfoService.getProductInfo Future.sequence(listOfProductFutures) }. map { products => WishList(products) } Future[List[Long]] List[Future[ProductInfo]] Future[List[ProductInfo]] Future[WishList]
  43. @crichardson Your mouse is your database Erik Meijer http://queue.acm.org/detail.cfm?id=2169076
  44. @crichardson Introducing Reactive Extensions (Rx) The Reactive Extensions (Rx) is a library for composing asynchronous and event-based programs using observable sequences and LINQ-style query operators. Using Rx, developers represent asynchronous data streams with Observables , query asynchronous data streams using LINQ operators , and ..... https://rx.codeplex.com/
  45. @crichardson About RxJava Reactive Extensions (Rx) for the JVM Original motivation for Netflix was to provide rich Futures Implemented in Java Adaptors for Scala, Groovy and Clojure Embraced by Akka and Spring Reactor: http://www.reactive-streams.org/ https://github.com/Netflix/RxJava
  46. @crichardson RxJava core concepts trait Observable[T] { def subscribe(observer : Observer[T]) : Subscription ... } trait Observer[T] { def onNext(value : T) def onCompleted() def onError(e : Throwable) } Notifies An asynchronous stream of items Used to unsubscribe
  47. Comparing Observable to... Observer pattern - similar but adds Observer.onComplete() Observer.onError() Iterator pattern - mirror image Push rather than pull Futures - similar Can be used as Futures But Observables = a stream of multiple values Collections and Streams - similar Functional API supporting map(), flatMap(), ... But Observables are asynchronous
  48. @crichardson Fun with observables val every10Seconds = Observable.interval(10 seconds) -1 0 1 ... t=0 t=10 t=20 ... val oneItem = Observable.items(-1L) val ticker = oneItem ++ every10Seconds val subscription = ticker.subscribe { (value: Long) => println("value=" + value) } ... subscription.unsubscribe()
  49. @crichardson def getTableStatus(tableName: String) : Observable[DynamoDbStatus]= Observable { subscriber: Subscriber[DynamoDbStatus] => } Connecting observables to the outside world amazonDynamoDBAsyncClient.describeTableAsync( new DescribeTableRequest(tableName), new AsyncHandler[DescribeTableRequest, DescribeTableResult] { override def onSuccess(request: DescribeTableRequest, result: DescribeTableResult) = { subscriber.onNext(DynamoDbStatus(result.getTable.getTableStatus)) subscriber.onCompleted() } override def onError(exception: Exception) = exception match { case t: ResourceNotFoundException => subscriber.onNext(DynamoDbStatus("NOT_FOUND")) subscriber.onCompleted() case _ => subscriber.onError(exception) } }) } Called once per subscriber Asynchronously gets information about DynamoDB table
  50. @crichardson Transforming observables val tableStatus : Observable[DynamoDbMessage] = ticker.flatMap { i => logger.info("{}th describe table", i + 1) getTableStatus(name) } Status1 Status2 Status3 ... t=0 t=10 t=20 ... + Usual collection methods: map(), filter(), take(), drop(), ...
  51. @crichardson Calculating rolling average class AverageTradePriceCalculator { def calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = { ... } case class Trade( symbol : String, price : Double, quantity : Int ... ) case class AveragePrice( symbol : String, price : Double, ...)
  52. @crichardson Calculating average prices def calculateAverages(trades: Observable[Trade]): Observable[AveragePrice] = { trades.groupBy(_.symbol). map { case (symbol, tradesForSymbol) => val openingEverySecond = Observable.items(-1L) ++ Observable.interval(1 seconds) def closingAfterSixSeconds(opening: Any) = Observable.interval(6 seconds).take(1) tradesForSymbol.window(openingEverySecond, closingAfterSixSeconds _).map { windowOfTradesForSymbol => windowOfTradesForSymbol.fold((0.0, 0, List[Double]())) { (soFar, trade) => val (sum, count, prices) = soFar (sum + trade.price, count + trade.quantity, trade.price +: prices) } map { case (sum, length, prices) => AveragePrice(symbol, sum / length, prices) } }.flatten }.flatten } Create an Observable of per-symbol Observables Create an Observable of per-symbol Observables
  53. @crichardson Agenda Why functional programming? Simplifying collection processing Simplifying concurrency with Futures and Rx Observables Tackling big data problems with functional programming
  54. @crichardson Let’s imagine that you want to count word frequencies
  55. @crichardson Scala Word Count val frequency : Map[String, Int] = Source.fromFile("gettysburgaddress.txt").getLines() .flatMap { _.split(" ") }.toList frequency("THE") should be(11) frequency("LIBERTY") should be(1) .groupBy(identity) .mapValues(_.length)) Map Reduce
  56. @crichardson But how to scale to a cluster of machines?
  57. @crichardson Apache Hadoop Open-source software for reliable, scalable, distributed computing Hadoop Distributed File System (HDFS) Efficiently stores very large amounts of data Files are partitioned and replicated across multiple machines Hadoop MapReduce Batch processing system Provides plumbing for writing distributed jobs Handles failures ...
  58. @crichardson Overview of MapReduce Input Data Mapper Mapper Mapper Reducer Reducer Reducer Output Data Shuffle (K,V) (K,V) (K,V) (K,V)* (K,V)* (K,V)* (K1,V, ....)* (K2,V, ....)* (K3,V, ....)* (K,V) (K,V) (K,V)
  59. @crichardson MapReduce Word count - mapper class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } (“Four”, 1), (“score”, 1), (“and”, 1), (“seven”, 1), ... Four score and seven years http://wiki.apache.org/hadoop/WordCount
  60. @crichardson Hadoop then shuffles the key-value pairs...
  61. @crichardson MapReduce Word count - reducer class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } (“the”, 11) (“the”, (1, 1, 1, 1, 1, 1, ...)) http://wiki.apache.org/hadoop/WordCount
  62. @crichardson About MapReduce Very simple programming abstract yet incredibly powerful By chaining together multiple map/reduce jobs you can process very large amounts of data in interesting ways e.g. Apache Mahout for machine learning But Mappers and Reducers = verbose code Development is challenging, e.g. unit testing is difficult It’s disk-based, batch processing slow
  63. @crichardson Scalding: Scala DSL for MapReduce class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) def tokenize(text : String) : Array[String] = { text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "") .split("s+") } } https://github.com/twitter/scalding Expressive and unit testable Each row is a map of named fields
  64. @crichardson Apache Spark Part of the Hadoop ecosystem Key abstraction = Resilient Distributed Datasets (RDD) Collection that is partitioned across cluster members Operations are parallelized Created from either a Scala collection or a Hadoop supported datasource - HDFS, S3 etc Can be cached in-memory for super-fast performance Can be replicated for fault-tolerance REPL for executing ad hoc queries http://spark.apache.org
  65. @crichardson Spark Word Count val sc = new SparkContext(...) sc.textFile("s3n://mybucket/...") .flatMap { _.split(" ")} .groupBy(identity) .mapValues(_.length) .toArray.toMap } } Expressive, unit testable and very fast
  66. @crichardson Summary Functional programming enables the elegant expression of good ideas in a wide variety of domains map(), flatMap() and reduce() are remarkably versatile higher-order functions Use FP and OOP together Java 8 has taken a good first step towards supporting FP
  67. @crichardson Questions? @crichardson chris@chrisrichardson.net http://plainoldobjects.com
Anzeige