Apache Spark is an open source Big Data analytical framework. It introduces the concept of RDDs (Resilient Distributed Datasets) which allow parallel operations on large datasets. The document discusses starting Spark, Spark applications, transformations and actions on RDDs, RDD creation in Scala and Python, and examples including word count. It also covers flatMap vs map, custom methods, and assignments involving transformations on lists.
2. Starting Spark
Apache Spark 2
Change the directory.
cd $SPARK_HOME
Start spark-shell by typing below command.
./bin/spark-shell
Start pyspark by typing below command.
./bin/pyspark
Start SparkR by typing below command.
./bin/sparkR
3. Spark Application details
Apache Spark 3
Driver program: Program which runs the user’s main function and executes various parallel
operations on a cluster.
SparkConf :Object that contains information about your application.
SparkContext :Object used to access the cluster.
Resilient distributed dataset (RDD) :Collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel.
5. Create a file spark_notes.txt with below
contents
Apache Spark 5
Apache Spark is an open source Big Data analytical framework.
RDD is the main abstraction in Apache Spark
Apache Spark can also be called as an unified engine.
Scala is programming and functional language.
Apache Spark is developed by using Scala programming language.
Lets start learning Apache Spark and become Data Scientist in Big Data Space.
6. RDD creation(Scala)
Apache Spark 6
1)
val rdd = sc.parallelize(List(1,2,3,4,5))
val multiply = rdd.map(x =>x*x)
multiply.collect()
2)
val textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt")
textRdd.first()
8. Examples
Apache Spark 8
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
lines.count() // Count the number of items in this RDD
val sparkLines = lines.filter(line => line.contains("Spark"))
sparkLines.count()
val scalaLines = lines.filter(line => line.contains("Scala"))
scalaLines.count()
9. Word Count Example.
Apache Spark 9
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
val flatMapWords = lines.flatMap(line => line.split(" "))
flatMapWords.collect()
val wordwithOneNumber = flatMapWords.map(word => (word, 1))
val count =wordwithOneNumber.reduceByKey((x, y) => x + y)
count.collect()
10. FlatMap() and map()
Apache Spark 10
val lines = sc.parallelize(List("hello world","hello spark"))
val wordsFlatMap = lines.flatMap(line => line.split(" "))
wordsFlatMap.collect()
val wordsMap = lines.map(line => line.split(" "))
wordsMap.collect()
11. Custom Method
Apache Spark 11
def sp(n:String):Array[String] = {n.split(" ")}
val rdd = sc.parallelize(List("Apache spark","spark core","spark ml")
val words = rdd.flatMap(sp)
words.collect()
val words = rdd.map(sp)
words.collect()
13. Assignments
Apache Spark 13
Lets take List =1,2,3,4,5,1,2,3,1
Write code for below problems
1)Add each element by itselft for above list
2)add one number to each element in List
3)Filter 1 from of above list
4)top 10 words from a file
5)Take only words which are more than 4 chars from a file