SlideShare ist ein Scribd-Unternehmen logo
1 von 42
© 2014 MapR Technologies 1© 2014 MapR Technologies
Introduction to Spark
© 2014 MapR Technologies 2© 2014 MapR Technologies
Introduction
© 2014 MapR Technologies 3
Introduction
• Marco Vasquez | Data Scientist | MapR Technologies
– Industry experience includes areas in research {bioinformatics, machine
learning, computer vision}, software engineering, and security
– I work in the professional services team. We work with MapR customers
to solve their big data business problems
• What is MapR Professional Services?
– Team of data scientist and engineers that work on solving complex
business problem by applying, some services offered
– Use Case Discovery (data analysis, modeling, get insights from data)
– Solution Design (develop a solution around those insights)
– Strategy Recommendations (big data corporate initiatives)
Brief description
© 2014 MapR Technologies 4
About this talk
1. Briefly review Data Science and how Spark helps me to do my
job
2. Introduction to Spark internals
3. Provide example use case: Machine Learning using public data
set (RITA)
• Questions from the audience. Have several team members who
can expand on MapR platform in general. Several MapR team
folks present. Make this interactive
We will cover three topics
© 2014 MapR Technologies 5© 2014 MapR Technologies
Spark and Data Science
© 2014 MapR Technologies 6
Introduction to Data Science
• Many definitions but I like “Insights from data that results in an
action that generates value”. Not enough to just get insights.
• At the core of Data Science is
– Data pre-processing, building predictive models, working with business
to identify use cases
• Tools commonly used are R, Matlab, or C/C++
• What about Spark?
What is Data Science?
© 2014 MapR Technologies 7
Spark can be useful in Data Science
• Spark allows for quick analysis and model development
• It provides access to the full data set, avoiding the need to
subsample as is the case with R
• Spark supports streaming, which can be used for building real-
time models full data sets
• Using MapR platform can integrate with Hadoop to build better
model that combines historical data and real-time data
• It can be use as the platform to build a real solution. Unlike R or
Matlab where another solution has be to used in production
© 2014 MapR Technologies 8© 2014 MapR Technologies
Spark
© 2014 MapR Technologies 9
Spark
• Spark is a distributed in memory computational framework
• It aims to provide a single platform that can be used for real-time
applications (streaming), complex analysis (machine learning),
interactive queries(shell), and batch processing (Hadoop integration)
• It has specialized modules that run on top of Spark
– SQL (Shark), Streaming, Graph Processing(GraphX), Machine
Learning(MLLIb)
• Spark introduces an abstract common data format that is used for
efficient data sharing across parallel computation - RDD
• Spark supports Map/Reduce programming model. Note - not same as
Hadoop MR
What is Spark?
© 2014 MapR Technologies 10
Spark Platform
Spark components and Hadoop integration
Shark
SQL
Spark
Streaming
GraphXMLLib
Spark
Data
HDFS
Hadoop
Yarn
Resource
MGR
Execution
Engine
RDD
Mesos
Mahout PigHive
© 2014 MapR Technologies 11
Spark General Flow
Files
Transform
ations Action
RDD RDD’
Value
© 2014 MapR Technologies 12
Spark
• Supports several ways to interact with Spark
– Spark Interactive Shell {Scala, Python}
– Programming in Java, Scala, and Python
• Works by applying transformation and actions on collection of
records called RDDs
• In-memory and fast
What are spark features?
© 2014 MapR Technologies 13
Clean API
• Resilient Distributed
Datasets
• Collections of objects spread
across a cluster, stored in
RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure
• Operations
• Transformations
(e.g. map, filter,
groupBy)
• Actions
(e.g. count,
collect, save)
Write programs in terms of transformations on
distributed datasets
© 2014 MapR Technologies 14
Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save ...
© 2014 MapR Technologies 15
User-Driven Roadmap
• Language support
– Improved Python support
– SparkR
– Java 8
– Integrated Schema and SQL
support in Spark’s APIs
• Better ML
– Sparse Data Support
– Model Evaluation Framework
– Performance Testing
© 2014 MapR Technologies 16
Basic Transformations
> nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
> squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
> even = squares.filter(lambda x: x % 2 == 0) // {4}
# Map each element to zero or more others
> nums.flatMap(lambda x: => range(x))
> # => {0, 0, 1, 0, 1, 2}
Range object (sequence of
numbers 0, 1, …, x-1)
© 2014 MapR Technologies 17
Basic Actions
> nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
# Return first K elements
> nums.take(2) # => [1, 2]
# Count number of elements
> nums.count() # => 3
# Merge elements with an associative function
> nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://file.txt”)
© 2014 MapR Technologies 18
Working with Key-Value Pairs
Spark’s “distributed reduce” transformations operate on
RDDs of key-value pairs
Python: pair = (a, b)
pair[0] # => a
pair[1] # => b
Scala: val pair = (a, b)
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a,
b);
pair._1 // => a
pair._2 // => b
© 2014 MapR Technologies 19
Some Key-Value Operations
> pets = sc.parallelize(
[(“cat”, 1), (“dog”, 1), (“cat”, 2)])
> pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
> pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}
> pets.sortByKey() # => {(cat, 1), (cat, 2), (dog,
1)}
reduceByKey also automatically implements
combiners on the map side
© 2014 MapR Technologies 20
> lines = sc.textFile(“hamlet.txt”)
> counts = lines.flatMap(lambda line: line.split(“ ”))
.map(lambda word => (word, 1))
.reduceByKey(lambda x, y: x + y)
Example: Word Count
“to be or”
“not to be”
“to”
“be”
“or”
“not”
“to”
“be”
(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)
(be, 2)
(not, 1)
(or, 1)
(to, 2)
© 2014 MapR Technologies 21
Other Key-Value Operations
> visits = sc.parallelize([ (“index.html”, “1.2.3.4”),
(“about.html”, “3.4.5.6”),
(“index.html”, “1.3.3.1”) ])
> pageNames = sc.parallelize([ (“index.html”, “Home”),
(“about.html”, “About”) ])
> visits.join(pageNames)
# (“index.html”, (“1.2.3.4”, “Home”))
# (“index.html”, (“1.3.3.1”, “Home”))
# (“about.html”, (“3.4.5.6”, “About”))
> visits.cogroup(pageNames)
# (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”]))
# (“about.html”, ([“3.4.5.6”], [“About”]))
© 2014 MapR Technologies 22© 2014 MapR Technologies
Spark Internals
© 2014 MapR Technologies 23
Spark application
• Driver program
• Java program that creates a SparkContext
• Executors
• Worker processes that execute tasks and store data
© 2014 MapR Technologies 24
Types of Applications
• Long lived/shared application
• Shark
• Spark Streaming
• Job Server (Ooyala)
• Short lived applications
• Standalone apps
• Shell sessions
May do mutli-
user scheduling
within
allocation from
cluster manger
© 2014 MapR Technologies 25
SparkContext
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you’d make your own (see later for
details)
© 2014 MapR Technologies 26
Resilient Distributed Datasets
• RDD is a read only, partitioned collection of records
• Since RDD is read only, mutable states are represented by many
RDDs
• Users can control persistence and partitioning
• Transformations or actions are applied to RDDs
What about RDD?
© 2014 MapR Technologies 27
Creating RDDs
# Turn a Python collection into an RDD
> sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
> sc.textFile(“file.txt”)
> sc.textFile(“directory/*.txt”)
> sc.textFile(“hdfs://namenode:9000/path/file”)
# Use existing Hadoop InputFormat (Java/Scala only)
> sc.hadoopFile(keyClass, valClass, inputFmt,
conf)
© 2014 MapR Technologies 28
Cluster manager
• Cluster manager grants executors to a Spark application
© 2014 MapR Technologies 29
Driver program
• Driver program decides when to launch tasks on which executor
Needs full network
connectivity to workers
© 2014 MapR Technologies 30© 2014 MapR Technologies
Spark Development
© 2014 MapR Technologies 31
Spark Programming
• Use IntelliJ and install Scala plugin to build jar files
• Use SBT for build tool. Possible to integrate Scala with Gradle
but difficult
• Write Scala code
• Deploy with code ‘sbt package’ to generate fat jar file
• Run code using ‘spark-submit’
• Use spark-shell for quick prototyping
Getting started with Spark and Scala
© 2014 MapR Technologies 32
Spark Development Environment
• Install Scala plugin for IntelliJ
• Create a Scala project and use ‘with SBT build’ option
• Add following lines to build.sbt file to pull in Spark dependencies
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" %
"1.0.0"
libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.0" %
"test"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
Using IntelliJ and Scala
© 2014 MapR Technologies 33
Deploy and Run
• Run ‘sbt package’ to generate the jar file
• Submit to spark engine using the following:
– spark-submit --class com.ps.ml.RitaML --master local[4] rita_2.10-1.0.jar
Using sbt and spark-submit
© 2014 MapR Technologies 34© 2014 MapR Technologies
Linear Regression using Spark
© 2014 MapR Technologies 35
Linear Regression using Spark
• Use linear regression using the following predictors:
– actual elapsed time, air time, departure delay, distance, taxi in, taxi out
• Steps:
– Import data
– Data pre-processing
– Build model
Goal: Build a model that predicts flight arrival delays
© 2014 MapR Technologies 36
RITA Data
Sample data
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled
Cancella
tionCod
e Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay
LateAircra
ftDelay
2008 1 3 4 2003 1955 2211 2225 WN 335 N712SW 128 150 116 -14 8 IAD TPA 810 4 8 0 0 NA NA NA NA NA
2008 1 3 4 754 735 1002 1000 WN 3231 N772SW 128 145 113 2 19 IAD TPA 810 5 10 0 0 NA NA NA NA NA
2008 1 3 4 628 620 804 750 WN 448 N428WN 96 90 76 14 8 IND BWI 515 3 17 0 0 NA NA NA NA NA
2008 1 3 4 926 930 1054 1100 WN 1746 N612SW 88 90 78 -6 -4 IND BWI 515 3 7 0 0 NA NA NA NA NA
2008 1 3 4 1829 1755 1959 1925 WN 3920 N464WN 90 90 77 34 34 IND BWI 515 3 10 0 0 2 0 0 0 32
2008 1 3 4 1940 1915 2121 2110 WN 378 N726SW 101 115 87 11 25 IND JAX 688 4 10 0 0 NA NA NA NA NA
2008 1 3 4 1937 1830 2037 1940 WN 509 N763SW 240 250 230 57 67 IND LAS 1591 3 7 0 0 10 0 0 0 47
2008 1 3 4 1039 1040 1132 1150 WN 535 N428WN 233 250 219 -18 -1 IND LAS 1591 7 7 0 0 NA NA NA NA NA
2008 1 3 4 617 615 652 650 WN 11 N689SW 95 95 70 2 2 IND MCI 451 6 19 0 0 NA NA NA NA NA
2008 1 3 4 1620 1620 1639 1655 WN 810 N648SW 79 95 70 -16 0 IND MCI 451 3 6 0 0 NA NA NA NA NA
2008 1 3 4 706 700 916 915 WN 100 N690SW 130 135 106 1 6 IND MCO 828 5 19 0 0 NA NA NA NA NA
2008 1 3 4 1644 1510 1845 1725 WN 1333 N334SW 121 135 107 80 94 IND MCO 828 6 8 0 0 8 0 0 0 72
2008 1 3 4 1426 1430 1426 1425 WN 829 N476WN 60 55 39 1 -4 IND MDW 162 9 12 0 0 NA NA NA NA NA
2008 1 3 4 715 715 720 710 WN 1016 N765SW 65 55 37 10 0 IND MDW 162 7 21 0 0 NA NA NA NA NA
2008 1 3 4 1702 1700 1651 1655 WN 1827 N420WN 49 55 35 -4 2 IND MDW 162 4 10 0 0 NA NA NA NA NA
2008 1 3 4 1029 1020 1021 1010 WN 2272 N263WN 52 50 37 11 9 IND MDW 162 6 9 0 0 NA NA NA NA NA
2008 1 3 4 1452 1425 1640 1625 WN 675 N286WN 228 240 213 15 27 IND PHX 1489 7 8 0 0 3 0 0 0 12
2008 1 3 4 754 745 940 955 WN 1144 N778SW 226 250 205 -15 9 IND PHX 1489 5 16 0 0 NA NA NA NA NA
2008 1 3 4 1323 1255 1526 1510 WN 4 N674AA 123 135 110 16 28 IND TPA 838 4 9 0 0 0 0 0 0 16
2008 1 3 4 1416 1325 1512 1435 WN 54 N643SW 56 70 49 37 51 ISP BWI 220 2 5 0 0 12 0 0 0 25
2008 1 3 4 706 705 807 810 WN 68 N497WN 61 65 51 -3 1 ISP BWI 220 3 7 0 0 NA NA NA NA NA
2008 1 3 4 1657 1625 1754 1735 WN 623 N724SW 57 70 47 19 32 ISP BWI 220 5 5 0 0 7 0 0 0 12
2008 1 3 4 1900 1840 1956 1950 WN 717 N786SW 56 70 49 6 20 ISP BWI 220 2 5 0 0 NA NA NA NA NA
2008 1 3 4 1039 1030 1133 1140 WN 1244 N714CB 54 70 47 -7 9 ISP BWI 220 2 5 0 0 NA NA NA NA NA
2008 1 3 4 801 800 902 910 WN 2101 N222WN 61 70 53 -8 1 ISP BWI 220 3 5 0 0 NA NA NA NA NA
2008 1 3 4 1520 1455 1619 1605 WN 2553 N394SW 59 70 50 14 25 ISP BWI 220 2 7 0 0 NA NA NA NA NA
2008 1 3 4 1422 1255 1657 1610 WN 188 N215WN 155 195 143 47 87 ISP FLL 1093 6 6 0 0 40 0 0 0 7
2008 1 3 4 1954 1925 2239 2235 WN 1754 N243WN 165 190 155 4 29 ISP FLL 1093 3 7 0 0 NA NA NA NA NA
2008 1 3 4 636 635 921 945 WN 2275 N454WN 165 190 147 -24 1 ISP FLL 1093 5 13 0 0 NA NA NA NA NA
2008 1 3 4 734 730 958 1020 WN 550 N712SW 324 350 314 -22 4 ISP LAS 2283 2 8 0 0 NA NA NA NA NA
© 2014 MapR Technologies 37
RITA ML - Initialize
// Import machine learning library
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
// Regex helper class
implicit class Regex(sc: StringContext) {
def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ =>
"x"): _*)
}
// Setup Spark context
val conf = new SparkConf().setAppName("RitaML")
val sc = new SparkContext(conf)
© 2014 MapR Technologies 38
RITA ML - Data Processing
// Import file and convert RDD
val rita08 = sc.textFile("maprfs:/user/ubuntu/input/2008.csv”)
// Remove header from RDD
val rita08_nh = rita08.filter(x => x.split(',')(0) match {
case r"d+" => true
case _ => false
})
// Assign name to field index
val actual_elapsed_time = 11
val airtime = 13
val arrdelay = 14
val depdelay = 15
val distance = 18
val taxiin = 19
val taxiout = 20
© 2014 MapR Technologies 39
RITA ML - Data Processing
def isna(s: String): Boolean = { s match { case "NA" => true
case _ => false
} }
// Get fields of interest and filter NAs
val rita08_nh_ftd = rita08_nh.map(x => x.split(',')).map(x =>
(x(arrdelay), (x(actual_elapsed_time),
x(airtime), x(depdelay), x(distance), x(taxiin),
x(taxiout)))).filter(x => !isna(x._1) && !isna(x._2._1) && !isna(x._2._2)
&& !isna(x._2._3) && !isna(x._2._4) && !isna(x._2._5) && !isna(x._2._6))
// Covert to Strings to LabeledPoint: (response variable,
Vector(predictors))
val rita08_training_data = rita08_nh_ftd.map(x =>
LabeledPoint(x._1.toDouble, Vectors.dense(Array(x._2._1.toDouble,
x._2._2.toDouble,
x._2._3.toDouble, x._2._4.toDouble, x._2._5.toDouble,
x._2._6.toDouble))))
© 2014 MapR Technologies 40
RITA ML – Train Model
val numIterations = 20
// Train LR model
val mymodel = LinearRegressionWithSGD.train(
rita08_training_data, numIterations)
// Get the values and predicted features values
val valuesAndPreds = rita08_training_data.map { point => val
prediction = mymodel.predict(point.features); (point.label,
prediction)
}
// Get the Mean Squared Error
val MSE = valuesAndPreds.map { case (v, p) => math.pow((v - p),
2)}.mean()
println("training Mean Squared Error = " + MSE)
© 2014 MapR Technologies 41
RITA ML – Analysis
- Things to try:
- Remove predictors
- Add new predictors
- Increase number of iterations to improve gradient descent
- Run again to determine whether the MSE decreases
- Iterate this process until you have an acceptable MSE
(That is strength of Spark, that this can be done
quickly)
© 2014 MapR Technologies 42
Q&A
@mapr maprtech
yourname@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Más contenido relacionado

Was ist angesagt?

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkManish Gupta
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark Summit
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkBTI360
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 

Was ist angesagt? (20)

Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
PySaprk
PySaprkPySaprk
PySaprk
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 

Ähnlich wie Intro to Apache Spark by Marco Vasquez

Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData ResumeAnil Sokhal
 

Ähnlich wie Intro to Apache Spark by Marco Vasquez (20)

Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark etl
Spark etlSpark etl
Spark etl
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Data Science
Data ScienceData Science
Data Science
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData Resume
 

Mehr von MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mehr von MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Último

From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfInfopole1
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3DianaGray10
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 

Último (20)

From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
SheDev 2024
SheDev 2024SheDev 2024
SheDev 2024
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
Extra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdfExtra-120324-Visite-Entreprise-icare.pdf
Extra-120324-Visite-Entreprise-icare.pdf
 
UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3UiPath Studio Web workshop Series - Day 3
UiPath Studio Web workshop Series - Day 3
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 

Intro to Apache Spark by Marco Vasquez

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies Introduction to Spark
  • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies Introduction
  • 3. © 2014 MapR Technologies 3 Introduction • Marco Vasquez | Data Scientist | MapR Technologies – Industry experience includes areas in research {bioinformatics, machine learning, computer vision}, software engineering, and security – I work in the professional services team. We work with MapR customers to solve their big data business problems • What is MapR Professional Services? – Team of data scientist and engineers that work on solving complex business problem by applying, some services offered – Use Case Discovery (data analysis, modeling, get insights from data) – Solution Design (develop a solution around those insights) – Strategy Recommendations (big data corporate initiatives) Brief description
  • 4. © 2014 MapR Technologies 4 About this talk 1. Briefly review Data Science and how Spark helps me to do my job 2. Introduction to Spark internals 3. Provide example use case: Machine Learning using public data set (RITA) • Questions from the audience. Have several team members who can expand on MapR platform in general. Several MapR team folks present. Make this interactive We will cover three topics
  • 5. © 2014 MapR Technologies 5© 2014 MapR Technologies Spark and Data Science
  • 6. © 2014 MapR Technologies 6 Introduction to Data Science • Many definitions but I like “Insights from data that results in an action that generates value”. Not enough to just get insights. • At the core of Data Science is – Data pre-processing, building predictive models, working with business to identify use cases • Tools commonly used are R, Matlab, or C/C++ • What about Spark? What is Data Science?
  • 7. © 2014 MapR Technologies 7 Spark can be useful in Data Science • Spark allows for quick analysis and model development • It provides access to the full data set, avoiding the need to subsample as is the case with R • Spark supports streaming, which can be used for building real- time models full data sets • Using MapR platform can integrate with Hadoop to build better model that combines historical data and real-time data • It can be use as the platform to build a real solution. Unlike R or Matlab where another solution has be to used in production
  • 8. © 2014 MapR Technologies 8© 2014 MapR Technologies Spark
  • 9. © 2014 MapR Technologies 9 Spark • Spark is a distributed in memory computational framework • It aims to provide a single platform that can be used for real-time applications (streaming), complex analysis (machine learning), interactive queries(shell), and batch processing (Hadoop integration) • It has specialized modules that run on top of Spark – SQL (Shark), Streaming, Graph Processing(GraphX), Machine Learning(MLLIb) • Spark introduces an abstract common data format that is used for efficient data sharing across parallel computation - RDD • Spark supports Map/Reduce programming model. Note - not same as Hadoop MR What is Spark?
  • 10. © 2014 MapR Technologies 10 Spark Platform Spark components and Hadoop integration Shark SQL Spark Streaming GraphXMLLib Spark Data HDFS Hadoop Yarn Resource MGR Execution Engine RDD Mesos Mahout PigHive
  • 11. © 2014 MapR Technologies 11 Spark General Flow Files Transform ations Action RDD RDD’ Value
  • 12. © 2014 MapR Technologies 12 Spark • Supports several ways to interact with Spark – Spark Interactive Shell {Scala, Python} – Programming in Java, Scala, and Python • Works by applying transformation and actions on collection of records called RDDs • In-memory and fast What are spark features?
  • 13. © 2014 MapR Technologies 13 Clean API • Resilient Distributed Datasets • Collections of objects spread across a cluster, stored in RAM or on Disk • Built through parallel transformations • Automatically rebuilt on failure • Operations • Transformations (e.g. map, filter, groupBy) • Actions (e.g. count, collect, save) Write programs in terms of transformations on distributed datasets
  • 14. © 2014 MapR Technologies 14 Expressive API • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...
  • 15. © 2014 MapR Technologies 15 User-Driven Roadmap • Language support – Improved Python support – SparkR – Java 8 – Integrated Schema and SQL support in Spark’s APIs • Better ML – Sparse Data Support – Model Evaluation Framework – Performance Testing
  • 16. © 2014 MapR Technologies 16 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  • 17. © 2014 MapR Technologies 17 Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
  • 18. © 2014 MapR Technologies 18 Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python: pair = (a, b) pair[0] # => a pair[1] # => b Scala: val pair = (a, b) pair._1 // => a pair._2 // => b Java: Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
  • 19. © 2014 MapR Technologies 19 Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side
  • 20. © 2014 MapR Technologies 20 > lines = sc.textFile(“hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)) .map(lambda word => (word, 1)) .reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
  • 21. © 2014 MapR Technologies 21 Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
  • 22. © 2014 MapR Technologies 22© 2014 MapR Technologies Spark Internals
  • 23. © 2014 MapR Technologies 23 Spark application • Driver program • Java program that creates a SparkContext • Executors • Worker processes that execute tasks and store data
  • 24. © 2014 MapR Technologies 24 Types of Applications • Long lived/shared application • Shark • Spark Streaming • Job Server (Ooyala) • Short lived applications • Standalone apps • Shell sessions May do mutli- user scheduling within allocation from cluster manger
  • 25. © 2014 MapR Technologies 25 SparkContext • Main entry point to Spark functionality • Available in shell as variable sc • In standalone programs, you’d make your own (see later for details)
  • 26. © 2014 MapR Technologies 26 Resilient Distributed Datasets • RDD is a read only, partitioned collection of records • Since RDD is read only, mutable states are represented by many RDDs • Users can control persistence and partitioning • Transformations or actions are applied to RDDs What about RDD?
  • 27. © 2014 MapR Technologies 27 Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)
  • 28. © 2014 MapR Technologies 28 Cluster manager • Cluster manager grants executors to a Spark application
  • 29. © 2014 MapR Technologies 29 Driver program • Driver program decides when to launch tasks on which executor Needs full network connectivity to workers
  • 30. © 2014 MapR Technologies 30© 2014 MapR Technologies Spark Development
  • 31. © 2014 MapR Technologies 31 Spark Programming • Use IntelliJ and install Scala plugin to build jar files • Use SBT for build tool. Possible to integrate Scala with Gradle but difficult • Write Scala code • Deploy with code ‘sbt package’ to generate fat jar file • Run code using ‘spark-submit’ • Use spark-shell for quick prototyping Getting started with Spark and Scala
  • 32. © 2014 MapR Technologies 32 Spark Development Environment • Install Scala plugin for IntelliJ • Create a Scala project and use ‘with SBT build’ option • Add following lines to build.sbt file to pull in Spark dependencies scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.0.0" libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.0" % "test" resolvers += "Akka Repository" at "http://repo.akka.io/releases/" Using IntelliJ and Scala
  • 33. © 2014 MapR Technologies 33 Deploy and Run • Run ‘sbt package’ to generate the jar file • Submit to spark engine using the following: – spark-submit --class com.ps.ml.RitaML --master local[4] rita_2.10-1.0.jar Using sbt and spark-submit
  • 34. © 2014 MapR Technologies 34© 2014 MapR Technologies Linear Regression using Spark
  • 35. © 2014 MapR Technologies 35 Linear Regression using Spark • Use linear regression using the following predictors: – actual elapsed time, air time, departure delay, distance, taxi in, taxi out • Steps: – Import data – Data pre-processing – Build model Goal: Build a model that predicts flight arrival delays
  • 36. © 2014 MapR Technologies 36 RITA Data Sample data Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled Cancella tionCod e Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircra ftDelay 2008 1 3 4 2003 1955 2211 2225 WN 335 N712SW 128 150 116 -14 8 IAD TPA 810 4 8 0 0 NA NA NA NA NA 2008 1 3 4 754 735 1002 1000 WN 3231 N772SW 128 145 113 2 19 IAD TPA 810 5 10 0 0 NA NA NA NA NA 2008 1 3 4 628 620 804 750 WN 448 N428WN 96 90 76 14 8 IND BWI 515 3 17 0 0 NA NA NA NA NA 2008 1 3 4 926 930 1054 1100 WN 1746 N612SW 88 90 78 -6 -4 IND BWI 515 3 7 0 0 NA NA NA NA NA 2008 1 3 4 1829 1755 1959 1925 WN 3920 N464WN 90 90 77 34 34 IND BWI 515 3 10 0 0 2 0 0 0 32 2008 1 3 4 1940 1915 2121 2110 WN 378 N726SW 101 115 87 11 25 IND JAX 688 4 10 0 0 NA NA NA NA NA 2008 1 3 4 1937 1830 2037 1940 WN 509 N763SW 240 250 230 57 67 IND LAS 1591 3 7 0 0 10 0 0 0 47 2008 1 3 4 1039 1040 1132 1150 WN 535 N428WN 233 250 219 -18 -1 IND LAS 1591 7 7 0 0 NA NA NA NA NA 2008 1 3 4 617 615 652 650 WN 11 N689SW 95 95 70 2 2 IND MCI 451 6 19 0 0 NA NA NA NA NA 2008 1 3 4 1620 1620 1639 1655 WN 810 N648SW 79 95 70 -16 0 IND MCI 451 3 6 0 0 NA NA NA NA NA 2008 1 3 4 706 700 916 915 WN 100 N690SW 130 135 106 1 6 IND MCO 828 5 19 0 0 NA NA NA NA NA 2008 1 3 4 1644 1510 1845 1725 WN 1333 N334SW 121 135 107 80 94 IND MCO 828 6 8 0 0 8 0 0 0 72 2008 1 3 4 1426 1430 1426 1425 WN 829 N476WN 60 55 39 1 -4 IND MDW 162 9 12 0 0 NA NA NA NA NA 2008 1 3 4 715 715 720 710 WN 1016 N765SW 65 55 37 10 0 IND MDW 162 7 21 0 0 NA NA NA NA NA 2008 1 3 4 1702 1700 1651 1655 WN 1827 N420WN 49 55 35 -4 2 IND MDW 162 4 10 0 0 NA NA NA NA NA 2008 1 3 4 1029 1020 1021 1010 WN 2272 N263WN 52 50 37 11 9 IND MDW 162 6 9 0 0 NA NA NA NA NA 2008 1 3 4 1452 1425 1640 1625 WN 675 N286WN 228 240 213 15 27 IND PHX 1489 7 8 0 0 3 0 0 0 12 2008 1 3 4 754 745 940 955 WN 1144 N778SW 226 250 205 -15 9 IND PHX 1489 5 16 0 0 NA NA NA NA NA 2008 1 3 4 1323 1255 1526 1510 WN 4 N674AA 123 135 110 16 28 IND TPA 838 4 9 0 0 0 0 0 0 16 2008 1 3 4 1416 1325 1512 1435 WN 54 N643SW 56 70 49 37 51 ISP BWI 220 2 5 0 0 12 0 0 0 25 2008 1 3 4 706 705 807 810 WN 68 N497WN 61 65 51 -3 1 ISP BWI 220 3 7 0 0 NA NA NA NA NA 2008 1 3 4 1657 1625 1754 1735 WN 623 N724SW 57 70 47 19 32 ISP BWI 220 5 5 0 0 7 0 0 0 12 2008 1 3 4 1900 1840 1956 1950 WN 717 N786SW 56 70 49 6 20 ISP BWI 220 2 5 0 0 NA NA NA NA NA 2008 1 3 4 1039 1030 1133 1140 WN 1244 N714CB 54 70 47 -7 9 ISP BWI 220 2 5 0 0 NA NA NA NA NA 2008 1 3 4 801 800 902 910 WN 2101 N222WN 61 70 53 -8 1 ISP BWI 220 3 5 0 0 NA NA NA NA NA 2008 1 3 4 1520 1455 1619 1605 WN 2553 N394SW 59 70 50 14 25 ISP BWI 220 2 7 0 0 NA NA NA NA NA 2008 1 3 4 1422 1255 1657 1610 WN 188 N215WN 155 195 143 47 87 ISP FLL 1093 6 6 0 0 40 0 0 0 7 2008 1 3 4 1954 1925 2239 2235 WN 1754 N243WN 165 190 155 4 29 ISP FLL 1093 3 7 0 0 NA NA NA NA NA 2008 1 3 4 636 635 921 945 WN 2275 N454WN 165 190 147 -24 1 ISP FLL 1093 5 13 0 0 NA NA NA NA NA 2008 1 3 4 734 730 958 1020 WN 550 N712SW 324 350 314 -22 4 ISP LAS 2283 2 8 0 0 NA NA NA NA NA
  • 37. © 2014 MapR Technologies 37 RITA ML - Initialize // Import machine learning library import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.classification.SVMWithSGD import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating // Regex helper class implicit class Regex(sc: StringContext) { def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*) } // Setup Spark context val conf = new SparkConf().setAppName("RitaML") val sc = new SparkContext(conf)
  • 38. © 2014 MapR Technologies 38 RITA ML - Data Processing // Import file and convert RDD val rita08 = sc.textFile("maprfs:/user/ubuntu/input/2008.csv”) // Remove header from RDD val rita08_nh = rita08.filter(x => x.split(',')(0) match { case r"d+" => true case _ => false }) // Assign name to field index val actual_elapsed_time = 11 val airtime = 13 val arrdelay = 14 val depdelay = 15 val distance = 18 val taxiin = 19 val taxiout = 20
  • 39. © 2014 MapR Technologies 39 RITA ML - Data Processing def isna(s: String): Boolean = { s match { case "NA" => true case _ => false } } // Get fields of interest and filter NAs val rita08_nh_ftd = rita08_nh.map(x => x.split(',')).map(x => (x(arrdelay), (x(actual_elapsed_time), x(airtime), x(depdelay), x(distance), x(taxiin), x(taxiout)))).filter(x => !isna(x._1) && !isna(x._2._1) && !isna(x._2._2) && !isna(x._2._3) && !isna(x._2._4) && !isna(x._2._5) && !isna(x._2._6)) // Covert to Strings to LabeledPoint: (response variable, Vector(predictors)) val rita08_training_data = rita08_nh_ftd.map(x => LabeledPoint(x._1.toDouble, Vectors.dense(Array(x._2._1.toDouble, x._2._2.toDouble, x._2._3.toDouble, x._2._4.toDouble, x._2._5.toDouble, x._2._6.toDouble))))
  • 40. © 2014 MapR Technologies 40 RITA ML – Train Model val numIterations = 20 // Train LR model val mymodel = LinearRegressionWithSGD.train( rita08_training_data, numIterations) // Get the values and predicted features values val valuesAndPreds = rita08_training_data.map { point => val prediction = mymodel.predict(point.features); (point.label, prediction) } // Get the Mean Squared Error val MSE = valuesAndPreds.map { case (v, p) => math.pow((v - p), 2)}.mean() println("training Mean Squared Error = " + MSE)
  • 41. © 2014 MapR Technologies 41 RITA ML – Analysis - Things to try: - Remove predictors - Add new predictors - Increase number of iterations to improve gradient descent - Run again to determine whether the MSE decreases - Iterate this process until you have an acceptable MSE (That is strength of Spark, that this can be done quickly)
  • 42. © 2014 MapR Technologies 42 Q&A @mapr maprtech yourname@mapr.com Engage with us! MapR maprtech mapr-technologies