Up and running with pyspark

Achilles Heel for Hadoop
 Hadoop is not fast enough *apparently* for things like ML .
 Need to Read again from disk after each MR job.
{ MR1 => HDFS =>MR2 => HDFS =>MR3 }
 MR , Let’s admit is a bit too complicated.
 The problem with giant codebase.
{Hadoop : 1.7 Million LOC}
{Spark : .35 Million LOC}

Why Spark?
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

A brief History of Spark Timeline
 UC Berkeley : The home of innovation.
 2009 : Started as a simple class project.
 The UCB folks wanted to create a Cluster Management system  Mesos
 They needed something to test on top of Mesos .  Voila Spark
 2010 : Open sourced under BSD licence
 Feb 2014 : Became Apache Top Level project
 Nov 2014 : New world record in Large scale sorting
https://soundcloud.com/oreilly-radar/apache-sparks-journey-from-academia-to-industry

Spark Concepts
 In memory Processing
{Processors : 64 bit ~~ Up to 1 TB RAM}
{Fact: RAM will always be faster than disk}
{Idea : Compress data, do processing }
{Remember : Data is distributed across various machines too}
 Resilient Distributed Datasets
http://www.gridgain.com/in-memory-computing-in-plain-english/
Resilient This is Sparta and we don’t give up on data without a fight.
Distributed A part of data is everywhere.
Dataset Meh!

A bit more on RDD
Basic unit of data in Spark
 RDDs are immutable
// int a=0;
// final int b =0;
 There are two main categories of operations on RDD
a) Transformation => Lazy evaluation.
=> Creates a new RDD from the existing RDD.
b) Actions => Return values
=> Write to disk
Eg : My mom asks me to buy Grocery items

Setting Up
Download “Prebuilt for Hadoop 2.4 and later”
 Build from source with Maven or sbt.
./bin/pyspark
http://spark.apache.org/downloads.html

Talk is Cheap! Show me the code
Pyspark shell is REPL.
 Creating an RDD
a) From data in memory.
b) From File.
c) From another RDD
rdd = sc.parallelize(“ChennaiPy”) // from string
nums = [1,2,3]
rdd_nums = sc.parallelize(nums) // from list
rdd_shakespeare= sc.textFile(“shakespeare.txt”) // from file

Transformations
Less Dramatic than this . But beautiful nevertheless.
 Classic Example 1 : Map
a) Beauty in this case comes from Lambda Expressions .
nums = [1,2,3,4,5,6]
rdd_nums = sc.parallelize(nums) // Creating our RDD
new_rdd = rdd_nums.map(lambda x : x**2) // You’ve got squares
print new_rdd.colect() // Finally some action

Did you say 80 Operations?
http://nbviewer.ipython.org/github/jkthompson/pyspark-
pictures/blob/master/pyspark-pictures.ipynb

Up and running with pyspark

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (10)

Ähnlich wie Up and running with pyspark

Ähnlich wie Up and running with pyspark (20)

Mehr von Krishna Sangeeth KS

Mehr von Krishna Sangeeth KS (17)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Up and running with pyspark

Hinweis der Redaktion