2. Achilles Heel for Hadoop
Hadoop is not fast enough *apparently* for things like ML .
Need to Read again from disk after each MR job.
{ MR1 => HDFS =>MR2 => HDFS =>MR3 }
MR , Let’s admit is a bit too complicated.
The problem with giant codebase.
{Hadoop : 1.7 Million LOC}
{Spark : .35 Million LOC}
4. A brief History of Spark Timeline
UC Berkeley : The home of innovation.
2009 : Started as a simple class project.
The UCB folks wanted to create a Cluster Management system Mesos
They needed something to test on top of Mesos . Voila Spark
2010 : Open sourced under BSD licence
Feb 2014 : Became Apache Top Level project
Nov 2014 : New world record in Large scale sorting
https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry
7. Spark Concepts
In memory Processing
{Processors : 64 bit ~~ Up to 1 TB RAM}
{Fact: RAM will always be faster than disk}
{Idea : Compress data, do processing }
{Remember : Data is distributed across various machines too}
Resilient Distributed Datasets
http://www.gridgain.com/in-memory-computing-in-plain-english/
Resilient This is Sparta and we don’t give up on data without a fight.
Distributed A part of data is everywhere.
Dataset Meh!
8. A bit more on RDD
Basic unit of data in Spark
RDDs are immutable
// int a=0;
// final int b =0;
There are two main categories of operations on RDD
a) Transformation => Lazy evaluation.
=> Creates a new RDD from the existing RDD.
b) Actions => Return values
=> Write to disk
Eg : My mom asks me to buy Grocery items
9. Setting Up
Download “Prebuilt for Hadoop 2.4 and later”
Build from source with Maven or sbt.
./bin/pyspark
http://spark.apache.org/downloads.html
10. Talk is Cheap! Show me the code
Pyspark shell is REPL.
Creating an RDD
a) From data in memory.
b) From File.
c) From another RDD
rdd = sc.parallelize(“ChennaiPy”) // from string
nums = [1,2,3]
rdd_nums = sc.parallelize(nums) // from list
rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file
11. Transformations
Less Dramatic than this . But beautiful nevertheless.
Classic Example 1 : Map
a) Beauty in this case comes from Lambda Expressions .
nums = [1,2,3,4,5,6]
rdd_nums = sc.parallelize(nums) // Creating our RDD
new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares
print new_rdd.colect() // Finally some action
12. Did you say 80 Operations?
http://nbviewer.ipython.org/github/jkthompson/pyspark-
pictures/blob/master/pyspark-pictures.ipynb