Apache spark basics

•

0 gefällt mir•253 views

Apache Spark is an open source Big Data analytical framework. It introduces the concept of RDDs (Resilient Distributed Datasets) which allow parallel operations on large datasets. The document discusses starting Spark, Spark applications, transformations and actions on RDDs, RDD creation in Scala and Python, and examples including word count. It also covers flatMap vs map, custom methods, and assignments involving transformations on lists.

Bildung

Starting Spark
Apache Spark 2
Change the directory.
cd $SPARK_HOME
Start spark-shell by typing below command.
./bin/spark-shell
Start pyspark by typing below command.
./bin/pyspark
Start SparkR by typing below command.
./bin/sparkR

Spark Application details
Apache Spark 3
Driver program: Program which runs the user’s main function and executes various parallel
operations on a cluster.
SparkConf :Object that contains information about your application.
SparkContext :Object used to access the cluster.
Resilient distributed dataset (RDD) :Collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel.

Operations on RDD
Apache Spark 4
Transformations : Returns another RDD
Action : Returns value.

Create a file spark_notes.txt with below
contents
Apache Spark 5
Apache Spark is an open source Big Data analytical framework.
RDD is the main abstraction in Apache Spark
Apache Spark can also be called as an unified engine.
Scala is programming and functional language.
Apache Spark is developed by using Scala programming language.
Lets start learning Apache Spark and become Data Scientist in Big Data Space.

RDD creation(Scala)
Apache Spark 6
1)
val rdd = sc.parallelize(List(1,2,3,4,5))
val multiply = rdd.map(x =>x*x)
multiply.collect()
2)
val textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt")
textRdd.first()

RDD creation(Python)
Apache Spark 7
1)
rdd = sc.parallelize([1,2,3,4,5])
multiply = rdd.map(lambda x :x*x)
multiply.collect()
2)
textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt")
textRdd.first()

Examples
Apache Spark 8
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
lines.count() // Count the number of items in this RDD
val sparkLines = lines.filter(line => line.contains("Spark"))
sparkLines.count()
val scalaLines = lines.filter(line => line.contains("Scala"))
scalaLines.count()

Word Count Example.
Apache Spark 9
val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt")
val flatMapWords = lines.flatMap(line => line.split(" "))
flatMapWords.collect()
val wordwithOneNumber = flatMapWords.map(word => (word, 1))
val count =wordwithOneNumber.reduceByKey((x, y) => x + y)
count.collect()

FlatMap() and map()
Apache Spark 10
val lines = sc.parallelize(List("hello world","hello spark"))
val wordsFlatMap = lines.flatMap(line => line.split(" "))
wordsFlatMap.collect()
val wordsMap = lines.map(line => line.split(" "))
wordsMap.collect()

Custom Method
Apache Spark 11
def sp(n:String):Array[String] = {n.split(" ")}
val rdd = sc.parallelize(List("Apache spark","spark core","spark ml")
val words = rdd.flatMap(sp)
words.collect()
val words = rdd.map(sp)
words.collect()

Transformations & Actions
Apache Spark 12

Assignments
Apache Spark 13
Lets take List =1,2,3,4,5,1,2,3,1
Write code for below problems
1)Add each element by itselft for above list
2)add one number to each element in List
3)Filter 1 from of above list
4)top 10 words from a file
5)Take only words which are more than 4 chars from a file

Weitere ähnliche Inhalte

Was ist angesagt?

Spark tutorialSahan Bulathwela

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks

Introduction to Spark InternalsPietro Michiardi

Introduction to sparkDuyhai Doan

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab

Stanford CS347 Guest Lecture: Apache SparkReynold Xin

Tuning and Debugging in Apache SparkPatrick Wendell

Spark overviewLisa Hua

SparkSQL: A Compiler from Queries to RDDsDatabricks

Automated Spark Deployment With Declarative InfrastructureSpark Summit

Road to AnalyticsDatio Big Data

Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

Spark SQL - 10 Things You Need to KnowKristian Alexander

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17thAlton Alexander

Spark etlImran Rashid

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly

SparkR: Enabling Interactive Data Science at Scalejeykottalam

Was ist angesagt? (20)

Spark tutorial

Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Introduction to Spark Internals

Introduction to spark

Apache Spark in Depth: Core Concepts, Architecture & Internals

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...

Stanford CS347 Guest Lecture: Apache Spark

Tuning and Debugging in Apache Spark

Spark overview

SparkSQL: A Compiler from Queries to RDDs

Automated Spark Deployment With Declarative Infrastructure

Road to Analytics

Spark SQL Deep Dive @ Melbourne Spark Meetup

Unified Big Data Processing with Apache Spark (QCON 2014)

Spark SQL - 10 Things You Need to Know

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Spark etl

Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0

Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...

SparkR: Enabling Interactive Data Science at Scale

Andere mochten auch

Hadoop admiin demosparrowAnalytics.com

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Introduction to Apache SparkRahul Jain

Real time Analytics with Apache Kafka and Apache SparkRahul Jain

Apache Spark ArchitectureAlexey Grishchenko

Apache Spark: The Analytics Operating SystemAdarsh Pannu

Introduction to Apache Sparkdatamantra

Build application using sbtsparrowAnalytics.com

Introduction to Apache Spark Developer TrainingCloudera, Inc.

Machine learning with raspberrypielmokhtar Benfraj

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale

BIGDATA & HADOOP PROJECTsparrowAnalytics.com

Big Data visualization with Apache Spark and Zeppelinprajods

Intro to Apache SparkMammoth Data

Apache Spark & ScalaEdureka!

1 divyadivyabaraskar22

$Brig waseem closed versus open managemnt of condylar fractures$ $Brig waseem closed versus open managemnt of condylar fractures$

Brig waseem closed versus open managemnt of condylar fracturesJamil Kifayatullah

Settlement of international disputes (International Law) Amicable(Rajat Vaish...R V

120105040 panduan-kawad-krs-dan-tkrsZamri Talib

IoT for Mushroom cultivation farmEmbionics Technologies Private Limited

Andere mochten auch (20)

Hadoop admiin demo

Apache Spark 2.0: Faster, Easier, and Smarter

Introduction to Apache Spark

Real time Analytics with Apache Kafka and Apache Spark

Apache Spark Architecture

Apache Spark: The Analytics Operating System

Introduction to Apache Spark

Build application using sbt

Introduction to Apache Spark Developer Training

Machine learning with raspberrypi

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

BIGDATA & HADOOP PROJECT

Big Data visualization with Apache Spark and Zeppelin

Intro to Apache Spark

Apache Spark & Scala

1 divya

$Brig waseem closed versus open managemnt of condylar fractures$ $Brig waseem closed versus open managemnt of condylar fractures$

Brig waseem closed versus open managemnt of condylar fractures

Settlement of international disputes (International Law) Amicable(Rajat Vaish...

120105040 panduan-kawad-krs-dan-tkrs

IoT for Mushroom cultivation farm

Ähnlich wie Apache spark basics

Apache Spark TutorialFarzad Nozarian

Spark corePrashant Gupta

Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa

Apache Spark An OverviewMohit Jain

Spark and scala..................................... ppt.pptxshivani22y

Apache Spark Majid Hajibaba

Apache Spark IntroductionRich Lee

Meetup ml spark_pptSnehal Nagmote

Introduction to Spark - DataFactZDataFactZ

PYSPARK PROGRAMMING.pdfMuhammadFauzi713466

Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos

Apache Spark Introductionsudhakara st

Spark浅谈Jiahua Zhu

Spark 101Mohit Garg

Intro to apache sparkAmine Sagaama

Apache Spark and DataStax EnablementVincent Poncet

Spark basic.pdfssuser8b6c85

Big Data Analytics with Apache SparkMarcoYuriFujiiMelo

Learning spark ch09 - Spark SQLphanleson

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Ähnlich wie Apache spark basics (20)

Apache Spark Tutorial

Spark core

Apache spark sneha challa- google pittsburgh-aug 25th

Apache Spark An Overview

Spark and scala..................................... ppt.pptx

Apache Spark

Apache Spark Introduction

Meetup ml spark_ppt

Introduction to Spark - DataFactZ

PYSPARK PROGRAMMING.pdf

Apache Spark Workshop, Apr. 2016, Euangelos Linardos

Apache Spark Introduction

Spark浅谈

Spark 101

Intro to apache spark

Apache Spark and DataStax Enablement

Spark basic.pdf

Big Data Analytics with Apache Spark

Learning spark ch09 - Spark SQL

Artigo 81 - spark_tutorial.pdf

Kürzlich hochgeladen

Towards a code of practice for AI in AT.pptxJisc

Salient Features of India constitution especially power and functionsKarakKing

Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid

Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand

HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC

Dyslexia AI Workshop for Slideshare.pptxcallscotland1987

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection

How to Create and Manage Wizard in Odoo 17Celine George

ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22

Introduction to Nonprofit Accounting: The BasicsTechSoup

The basics of sentences session 3pptx.pptxheathfieldcps1

ICT role in 21st century education and it's challenges.MaryamAhmad92

Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith

FSB Advising Checklist - Orientation 2024Elizabeth Walsh

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur

Graduate Outcomes Presentation Slides - Englishneillewis46

Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417

How to Give a Domain for a Field in Odoo 17Celine George

Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University of Engineering & Technology, Jamshoro

Kürzlich hochgeladen (20)

Towards a code of practice for AI in AT.pptx

Salient Features of India constitution especially power and functions

Basic Civil Engineering first year Notes- Chapter 4 Building.pptx

Google Gemini An AI Revolution in Education.pptx

HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx

Dyslexia AI Workshop for Slideshare.pptx

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...

How to Create and Manage Wizard in Odoo 17

ICT Role in 21st Century Education & its Challenges.pptx

Introduction to Nonprofit Accounting: The Basics

The basics of sentences session 3pptx.pptx

ICT role in 21st century education and it's challenges.

Fostering Friendships - Enhancing Social Bonds in the Classroom

FSB Advising Checklist - Orientation 2024

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx

Graduate Outcomes Presentation Slides - English

Unit-IV- Pharma. Marketing Channels.pptx

How to Give a Domain for a Field in Odoo 17

Mehran University Newsletter Vol-X, Issue-I, 2024

Apache spark basics

1. Apache Spark Basics Apache Spark 1

2. Starting Spark Apache Spark 2 Change the directory. cd $SPARK_HOME Start spark-shell by typing below command. ./bin/spark-shell Start pyspark by typing below command. ./bin/pyspark Start SparkR by typing below command. ./bin/sparkR

3. Spark Application details Apache Spark 3 Driver program: Program which runs the user’s main function and executes various parallel operations on a cluster. SparkConf :Object that contains information about your application. SparkContext :Object used to access the cluster. Resilient distributed dataset (RDD) :Collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

4. Operations on RDD Apache Spark 4 Transformations : Returns another RDD Action : Returns value.

5. Create a file spark_notes.txt with below contents Apache Spark 5 Apache Spark is an open source Big Data analytical framework. RDD is the main abstraction in Apache Spark Apache Spark can also be called as an unified engine. Scala is programming and functional language. Apache Spark is developed by using Scala programming language. Lets start learning Apache Spark and become Data Scientist in Big Data Space.

6. RDD creation(Scala) Apache Spark 6 1) val rdd = sc.parallelize(List(1,2,3,4,5)) val multiply = rdd.map(x =>x*x) multiply.collect() 2) val textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt") textRdd.first()

7. RDD creation(Python) Apache Spark 7 1) rdd = sc.parallelize([1,2,3,4,5]) multiply = rdd.map(lambda x :x*x) multiply.collect() 2) textRdd = sc.textFile("/home/ubuntu/work/spark_notes.txt") textRdd.first()

8. Examples Apache Spark 8 val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt") lines.count() // Count the number of items in this RDD val sparkLines = lines.filter(line => line.contains("Spark")) sparkLines.count() val scalaLines = lines.filter(line => line.contains("Scala")) scalaLines.count()

9. Word Count Example. Apache Spark 9 val lines = sc.textFile("/home/ubuntu/work/spark_notes.txt") val flatMapWords = lines.flatMap(line => line.split(" ")) flatMapWords.collect() val wordwithOneNumber = flatMapWords.map(word => (word, 1)) val count =wordwithOneNumber.reduceByKey((x, y) => x + y) count.collect()

10. FlatMap() and map() Apache Spark 10 val lines = sc.parallelize(List("hello world","hello spark")) val wordsFlatMap = lines.flatMap(line => line.split(" ")) wordsFlatMap.collect() val wordsMap = lines.map(line => line.split(" ")) wordsMap.collect()

11. Custom Method Apache Spark 11 def sp(n:String):Array[String] = {n.split(" ")} val rdd = sc.parallelize(List("Apache spark","spark core","spark ml") val words = rdd.flatMap(sp) words.collect() val words = rdd.map(sp) words.collect()

12. Transformations & Actions Apache Spark 12

13. Assignments Apache Spark 13 Lets take List =1,2,3,4,5,1,2,3,1 Write code for below problems 1)Add each element by itselft for above list 2)add one number to each element in List 3)Filter 1 from of above list 4)top 10 words from a file 5)Take only words which are more than 4 chars from a file

14. Thanks Apache Spark 14

Apache spark basics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Apache spark basics

Ähnlich wie Apache spark basics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache spark basics