2. MONOID!
Actuallyit'sa semigroup,monoid just soundsmore interesting :)
A Little Teaser
Crunch:CombineFns are used to representthe associative operations...
PGroupedTable<K,V>::combineValues(CombineFn<K,V>combineFn,
CombineFn<K,V>reduceFn)
Scalding:reduce with fn which mustbe associative and commutative
KeyedList[K,T]::reduce(fn:(T,T)=>T)
Spark:Merge the values for each key using an associative reduce function
PairRDDFunctions[K,V]::reduceByKey(fn:(V,V)=>V)
All ofthem work on both mapper and reducer side
0
3. MY STORY
Before
Mostly Python/C++ (and PHP...)
No Java experience at all
Started using Scala early 2013
Now
Discovery's* Java backend/riemann guy
The Scalding/Spark/Storm guy
Contributor to Spark, chill, cascading.avro
*Spotify'smachine learning and recommendation team
4. WHY THIS TALK?
Not a tutorial
Discovery's experience
Why FP matters
Why Scala matters
Common misconceptions
5. WHAT WE ALREADY USE
Kafka
Scalding
Spark / MLLib
Stratosphere
Storm / Riemann (Clojure)
6. WHAT WE WANT TO INVESTIGATE
Summingbird (Scala for Storm + Hadoop)
Spark Streaming
Shark / SparkSQL
GraphX (Spark)
BIDMach (GPU ML with GPU)
7. DISCOVERY
Mid 2013: 100+ Python jobs
10+ hires since (half since new year)
Few with Java experience, none with Scala
As of May 2014: ~100 Scalding jobs & 90 tests
More uncommited ad-hoc jobs
12+ commiters, 4+ using Spark
11. WHY FUNCTIONAL
Higher order functions
Expressions, not statements
Focus on problem solving
Not solving programming problems
12. WHY FUNCTIONAL
Word count in Python
lyrics=["WeallliveinAmerika","Amerikaistwunderbar"]
wc=defaultdict(int)
forlinlyrics:
forwinl.split():
wc[w]+=1
Screen too small for the Java version
13. WHY FUNCTIONAL
Map and reduce are key concepts in FP
vallyrics=List("WeallliveinAmerika","Amerikaistwunderbar")
lyrics.flatMap(_.split("")) //map
.groupBy(identity) //shuffle
.map{case(k,g)=>(k,g.size)} //reduce
(deflyrics["WeallliveinAmerika""Amerikaistwunderbar"])
(->>lyrics(mapcat#(clojure.string/split%#"s"))
(group-byidentity)
(map(fn[[kg]][k(countg)])))
importControl.Arrow
importData.List
letlyrics=["WeallliveinAmerika","Amerikaistwunderbar"]
mapwords>>>concat
>>>sort>>>group
>>>map(x->(headx,lengthx))$lyrics
14. WHY FUNCTIONAL
Linear equation in ALS matrixfactorization
= ( Y + ( − I)Y p(u)xu Y
T
Y
T
C
u
)
−1
Y
T
C
u
vectors.map{case(id,vec)=>(id,vec*vec.T)} //YtY
.map(_._2).reduce(_+_)
ratings.keyBy(fixedKey).join(outerProducts) //YtCuIY
.map{case(_,(r,op))=>(solveKey(r),op*(r.rating*alpha))}
.reduceByKey(_+_)
ratings.keyBy(fixedKey).join(vectors) //YtCupu
.map{case(_,(r,vec))=>
valCui=r.rating*alpha+1
valpui=if(Cui>0.0)1.0else0.0
(solveKey(r),vec*(Cui*pui))
}.reduceByKey(_+_)
15. WHY SCALA
JVM - libraries and tools
Pythonesque syntax
Static typing with inference
Transition from imperative to FP
16. WHY SCALA
Performance vs. agility
http://nicholassterling.wordpress.com/2012/11/16/scala-performance/
18. WHY SCALA
Higher order functions
List<Integer>list=Lists.newArrayList(1,2,3);
Lists.transform(list,newFunction<Integer,Integer>(){
@Override
publicIntegerapply(Integerinput){
returninput+1;
}
});
vallist=List(1,2,3)
list.map(_+1) //List(2,3,4)
And then imagine ifyou have to chain or nested functions
20. WHY SCALA
Scalding field based word count
TextLine(path))
.flatMap('line->'word){line:String=>line.split("""W+""")}
.groupBy('word){_.size}
Scalding type-safe word count
TextLine(path).read.toTypedPipe[String](Fields.ALL)
.flatMap(_.split(""W+""))
.groupBy(identity).size
Scrunch word count
read(from.textFile(file))
.flatMap(_.split("""W+""")
.count
21. WHY SCALA
Summingbird word count
source
.flatMap{line:String=>line.split("""W+""").map((_,1))}
.sumByKey(store)
Spark word count
sc.textFile(path)
.flatMap(_.split("""W+"""))
.map(word=>(word,1))
.reduceByKey(_+_)
Stratosphere word count
TextFile(textInput)
.flatMap(_.split("""W+"""))
.map(word=>(word,1))
.groupBy(_._1)
.reduce{(w1,w2)=>(w1._1,w1._2+w2._2)}
22. WHY SCALA
Many patterns also common in Java
Java 8 lambdas and streams
Guava, Crunch, etc.
Optional, Predicate
Collection transformations
ListenableFuture and transform
parallelDo, DoFn, MapFn, CombineFn
24. COMMON MISCONCEPTIONS
It's slow
No slower than Python
Depend on how pure FP
Trade off with productivity
Drop down to Java or native libraries
25. COMMON MISCONCEPTIONS
I don't want to learn a new language
How about flatMap, reduce, fold, etc.?
Unnecessary overhead
interfacing with Python or Java
You've used monoids, monads,
or higher order functions already