SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
APACHE SPARK OVERVIEW 
tech talk @ ferret 
Andrii Gakhov
• Apache Spark™ is a fast and general engine for 
large-scale data processing. 
• Lastest release: Spark 1.1.1 (Nov 26, 2014) 
• spark.apache.org 
• Originally developed in 2009 in UC Berkeley’s 
AMPLab, and open sourced in 2010. Now Spark is 
supported by Databricks.
APACHE SPARK 
Spark SQL MLlib GraphX Streaming 
standalone 
with local 
storage 
Apache Spark 
MESOS YARN 
EC2 
S3 HDFS 
node node node node
RDD 
• Spark’s primary conception is a Resilient 
Distributed Dataset (RDD) - abstraction of an 
immutable, distributed dataset. 
textFile = sc.textFile(“api.log") 
anotherFile = sc.textFile(“hdfs://var/log/api.log”) 
• Collections of objects that can be stored in memory 
or disk across the cluster 
• Parallel functional transformations (map, filter, …) 
• Automatically rebuild of failure
RDD 
• RDDs have actions, which retur n values, and 
transformations, which return pointers to new RDDs. 
• Actions: 
• reduce collect count countByKey take saveAsTextFile 
takeSample … 
• Transformations: 
• map filter flatMap distinct sample join union intersection 
reduceByKey groupByKey sortByKey … 
errors = logFile.filter(lambda line: line.startswith(“ERROR”)) 
print errors.count()
PERSISTANCE 
• You can control persistence of RDD across operations 
(MEMORY_ONLY, MEMORY_AND_DISK …) 
• When you persist an RDD in memory, each node stores 
any partitions of it that it computes in memory and 
reuses them in other actions on that dataset (or datasets 
derived from it) 
• This allows future actions to be much faster (often by 
more than 10x). 
errors.cache() 
endpoint_errors = errors.filter( 
lambda line: “/test/endpoint” in line) 
endpoint_errors.count()
HDFS 
iteration iteration iteration 
Hadoop MapReduce 
iteration iteration iteration 
MEMORY HDFS 
Apache Spark
INTERACTIVE DEMO 
STRATA+HADOOP WORD EXAMPLE 
http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-2014-with-apache-spark-and-d3.html
SPARK SQL 
TRANSFORM RDD WITH SQL
SCHEMA RDD 
• Spark SQL allows relational queries expressed in SQL, 
HiveQL, or Scala to be executed using Spark. 
• At the core of this component is a new type of RDD - 
SchemaRDD. 
• SchemaRDDs are composed of Row objects, along with a 
schema that describes the data types of each column in the row. 
• A SchemaRDD is similar to a table in a traditional relational 
database. 
• A SchemaRDD can be created from an existing RDD, a Parquet 
file, a JSON dataset, or by running HiveQL against data stored in 
Apache Hive.
SCHEMA RDD 
• To work with SparkSQL you need SQLContext 
(or HiveContext) 
from spark.sql import SQLContext 
sqlCtx = SQLContext(sc) 
records = sc.textFile(“customers.csv”) 
customers = records.map(lambda line: line.split(“,”)) 
.map(lambda r: Row(name=r[0], age=int(r[1]))) 
customersTable = sqlCtx.inferSchema(customers) 
customersTable.registerAsTable(“customers”)
SCHEMA RDD 
User 
User 
User 
Name Age Phone 
Name Age Phone 
Name Age Phone 
RDD SchemaRDD 
• Transformations over RDD are just functional 
transformation on partitioned collections of objects 
• Transformation over the SchemaRDD are 
declarative transformations on par titioned 
collections of tuples
SPARK SQL 
• Schema RDD could be used as regular RDD at 
the same time. 
seniors = sqlCtx.sql(“”” 
SELECT from customers WHERE age >= 70”””) 
print seniors.count() 
print seniors.map(lambda r: “Name: “ + r.name).take(10)
MLLIB 
Distributed Machine Learning
MACHINE LEARNING LIBRARY 
• MLlib uses the linear algebra package Breeze, 
which depends on netlib-java, and jblas 
• MLlib in Python requires NumPy version 1.4+ 
• MLlib is under active development 
• Many API changes every release 
• Not all algorithms are fully functional
MACHINE LEARNING LIBRARY 
• Basic statistics 
• Classification and regression 
• linear models (SVMs, logistic regression, linear 
regression) 
• decision trees 
• naive Bayes 
• Collaborative filtering 
• alternating least squares (ALS) 
• Clustering 
• k-means
MACHINE LEARNING LIBRARY 
• Dimensionality reduction 
• singular value decomposition (SVD) 
• principal component analysis (PCA) 
• Feature extraction and transformation 
• Optimization 
• stochastic gradient descent 
• limited-memory BFGS (L-BFGS)
MACHINE LEARNING LIBRARY 
• LinearRegression with stochastic gradient descent (SGD) 
example on Spark: 
def parsePoint(line): 
values = [float(x) for x in line.replace(',', ' ').split(' ')] 
return LabeledPoint(values[0], values[1:]) 
parsedData = data.map(parsePoint) 
model = LinearRegressionWithSGD.train(parsedData) 
valuesAndPreds = parsedData.map( 
lambda p: (p.label, model.predict(p.features))) 
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2) 
.reduce(lambda x, y: x + y) / valuesAndPreds.count()
SPARK STREAMING 
Fault-tolerant stream processing
SPARK STREAMING 
• Spark Streaming enables scalable, high-throughput, 
fault-tolerant stream processing of live data streams 
• Spark Streaming provides a high-level abstraction 
called discretized stream or DStream, which 
represents a continuous stream of data 
• Internally, a DStream is represented as a sequence 
of RDDs.
SPARK STREAMING 
• Example of processing Twitter Stream with Spark 
Streaming: 
import org.apache.spark.streaming._ 
import org.apache.spark.streaming.twitter._ 
… 
val ssc = new StreamingContext(sc, Seconds(1)) 
val tweets = TwitterUtils.createStream(ssc, auth) 
val hashTags = tweets.flatMap(status=>getTags(status)) 
hashTags.saveAsHadoopFiles("hdfs://...")
SPARK STREAMING 
• Any operation applied on a DStream translates to 
operations on the underlying RDDs. 
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
SPARK STREAMING 
• Spark Streaming also provides windowed 
computations, which allow you to apply 
transformations over a sliding window of data
CONCLUSIONS
SPEED 
• Run programs up to 100x faster than Hadoop 
MapReduce in memory, or 10x faster on disk. 
Logistic regression 
in Hadoop and Spark 
• Spark has won the Daytona GraySort contest for 
2014 (sortbenchmark.org) with 4.27 TB/min 
(in 2013 Hadoop was the winner with 1.42 TB/min)
EASE OF USE 
• Supports out of the box: 
• Java 
• Scala 
• Python 
• You can use it interactively from the Scala and 
Python shells
GENERALITY 
• SQL with SparkSQL 
• Machine Learning with MLlib 
• Graphs computation with GraphX 
• Streaming processing with Spark Streaming
RUNS EVERYWHERE 
• Spark could be run on 
• Hadoop (YARN) 
• Mesos 
• standalone 
• in the cloud 
• Spark can read from 
• S3 
• HDFS 
• HBase 
• Cassandra 
• any Hadoop data source.
Thank you. 
• Credentials: 
• http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014 
• http://spark.apache.org 
• http://www.databricks.com 
• http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona- 
2014-with-apache-spark-and-d3.html

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 

Was ist angesagt? (20)

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Spark vs Hadoop
Spark vs HadoopSpark vs Hadoop
Spark vs Hadoop
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Working with the Scalding Type -Safe API
Working with the Scalding Type -Safe APIWorking with the Scalding Type -Safe API
Working with the Scalding Type -Safe API
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 

Andere mochten auch

Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedinYukti Kaura
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathSpark Summit
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonDatabricks
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 

Andere mochten auch (20)

Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...
 
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMathExtreme-scale Ad-Tech using Spark and Databricks at MediaMath
Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark:  Real time Advanced Analytics and Machine Learning with SparkSpark After Dark:  Real time Advanced Analytics and Machine Learning with Spark
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 

Ähnlich wie Apache Spark Overview @ ferret

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupRadu Chilom
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 

Ähnlich wie Apache Spark Overview @ ferret (20)

Spark core
Spark coreSpark core
Spark core
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetupxPatterns on Spark, Tachyon and Mesos - Bucharest meetup
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 

Mehr von Andrii Gakhov

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureAndrii Gakhov
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaAndrii Gakhov
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsAndrii Gakhov
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityAndrii Gakhov
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данныхAndrii Gakhov
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksAndrii Gakhov
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start GuideAndrii Gakhov
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlightsAndrii Gakhov
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcasesAndrii Gakhov
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Andrii Gakhov
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Andrii Gakhov
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Andrii Gakhov
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Andrii Gakhov
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Andrii Gakhov
 

Mehr von Andrii Gakhov (20)

Let's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architectureLet's start GraphQL: structure, behavior, and architecture
Let's start GraphQL: structure, behavior, and architecture
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 
Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...Too Much Data? - Just Sample, Just Hash, ...
Too Much Data? - Just Sample, Just Hash, ...
 
DNS Delegation
DNS DelegationDNS Delegation
DNS Delegation
 
Implementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and LuaImplementing a Fileserver with Nginx and Lua
Implementing a Fileserver with Nginx and Lua
 
Pecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food TraditionsPecha Kucha: Ukrainian Food Traditions
Pecha Kucha: Ukrainian Food Traditions
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Probabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. CardinalityProbabilistic data structures. Part 2. Cardinality
Probabilistic data structures. Part 2. Cardinality
 
Вероятностные структуры данных
Вероятностные структуры данныхВероятностные структуры данных
Вероятностные структуры данных
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Apache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected TalksApache Big Data Europe 2015: Selected Talks
Apache Big Data Europe 2015: Selected Talks
 
Swagger / Quick Start Guide
Swagger / Quick Start GuideSwagger / Quick Start Guide
Swagger / Quick Start Guide
 
API Days Berlin highlights
API Days Berlin highlightsAPI Days Berlin highlights
API Days Berlin highlights
 
ELK - What's new and showcases
ELK - What's new and showcasesELK - What's new and showcases
ELK - What's new and showcases
 
Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014Data Mining - lecture 8 - 2014
Data Mining - lecture 8 - 2014
 
Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014Data Mining - lecture 7 - 2014
Data Mining - lecture 7 - 2014
 
Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014Data Mining - lecture 6 - 2014
Data Mining - lecture 6 - 2014
 
Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014Data Mining - lecture 5 - 2014
Data Mining - lecture 5 - 2014
 
Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014Data Mining - lecture 4 - 2014
Data Mining - lecture 4 - 2014
 

Kürzlich hochgeladen

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Kürzlich hochgeladen (20)

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Apache Spark Overview @ ferret

  • 1. APACHE SPARK OVERVIEW tech talk @ ferret Andrii Gakhov
  • 2. • Apache Spark™ is a fast and general engine for large-scale data processing. • Lastest release: Spark 1.1.1 (Nov 26, 2014) • spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010. Now Spark is supported by Databricks.
  • 3. APACHE SPARK Spark SQL MLlib GraphX Streaming standalone with local storage Apache Spark MESOS YARN EC2 S3 HDFS node node node node
  • 4. RDD • Spark’s primary conception is a Resilient Distributed Dataset (RDD) - abstraction of an immutable, distributed dataset. textFile = sc.textFile(“api.log") anotherFile = sc.textFile(“hdfs://var/log/api.log”) • Collections of objects that can be stored in memory or disk across the cluster • Parallel functional transformations (map, filter, …) • Automatically rebuild of failure
  • 5. RDD • RDDs have actions, which retur n values, and transformations, which return pointers to new RDDs. • Actions: • reduce collect count countByKey take saveAsTextFile takeSample … • Transformations: • map filter flatMap distinct sample join union intersection reduceByKey groupByKey sortByKey … errors = logFile.filter(lambda line: line.startswith(“ERROR”)) print errors.count()
  • 6. PERSISTANCE • You can control persistence of RDD across operations (MEMORY_ONLY, MEMORY_AND_DISK …) • When you persist an RDD in memory, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it) • This allows future actions to be much faster (often by more than 10x). errors.cache() endpoint_errors = errors.filter( lambda line: “/test/endpoint” in line) endpoint_errors.count()
  • 7. HDFS iteration iteration iteration Hadoop MapReduce iteration iteration iteration MEMORY HDFS Apache Spark
  • 8. INTERACTIVE DEMO STRATA+HADOOP WORD EXAMPLE http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-2014-with-apache-spark-and-d3.html
  • 9. SPARK SQL TRANSFORM RDD WITH SQL
  • 10. SCHEMA RDD • Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. • At the core of this component is a new type of RDD - SchemaRDD. • SchemaRDDs are composed of Row objects, along with a schema that describes the data types of each column in the row. • A SchemaRDD is similar to a table in a traditional relational database. • A SchemaRDD can be created from an existing RDD, a Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.
  • 11. SCHEMA RDD • To work with SparkSQL you need SQLContext (or HiveContext) from spark.sql import SQLContext sqlCtx = SQLContext(sc) records = sc.textFile(“customers.csv”) customers = records.map(lambda line: line.split(“,”)) .map(lambda r: Row(name=r[0], age=int(r[1]))) customersTable = sqlCtx.inferSchema(customers) customersTable.registerAsTable(“customers”)
  • 12. SCHEMA RDD User User User Name Age Phone Name Age Phone Name Age Phone RDD SchemaRDD • Transformations over RDD are just functional transformation on partitioned collections of objects • Transformation over the SchemaRDD are declarative transformations on par titioned collections of tuples
  • 13. SPARK SQL • Schema RDD could be used as regular RDD at the same time. seniors = sqlCtx.sql(“”” SELECT from customers WHERE age >= 70”””) print seniors.count() print seniors.map(lambda r: “Name: “ + r.name).take(10)
  • 15. MACHINE LEARNING LIBRARY • MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas • MLlib in Python requires NumPy version 1.4+ • MLlib is under active development • Many API changes every release • Not all algorithms are fully functional
  • 16. MACHINE LEARNING LIBRARY • Basic statistics • Classification and regression • linear models (SVMs, logistic regression, linear regression) • decision trees • naive Bayes • Collaborative filtering • alternating least squares (ALS) • Clustering • k-means
  • 17. MACHINE LEARNING LIBRARY • Dimensionality reduction • singular value decomposition (SVD) • principal component analysis (PCA) • Feature extraction and transformation • Optimization • stochastic gradient descent • limited-memory BFGS (L-BFGS)
  • 18. MACHINE LEARNING LIBRARY • LinearRegression with stochastic gradient descent (SGD) example on Spark: def parsePoint(line): values = [float(x) for x in line.replace(',', ' ').split(' ')] return LabeledPoint(values[0], values[1:]) parsedData = data.map(parsePoint) model = LinearRegressionWithSGD.train(parsedData) valuesAndPreds = parsedData.map( lambda p: (p.label, model.predict(p.features))) MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2) .reduce(lambda x, y: x + y) / valuesAndPreds.count()
  • 19. SPARK STREAMING Fault-tolerant stream processing
  • 20. SPARK STREAMING • Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams • Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data • Internally, a DStream is represented as a sequence of RDDs.
  • 21. SPARK STREAMING • Example of processing Twitter Stream with Spark Streaming: import org.apache.spark.streaming._ import org.apache.spark.streaming.twitter._ … val ssc = new StreamingContext(sc, Seconds(1)) val tweets = TwitterUtils.createStream(ssc, auth) val hashTags = tweets.flatMap(status=>getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...")
  • 22. SPARK STREAMING • Any operation applied on a DStream translates to operations on the underlying RDDs. RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
  • 23. SPARK STREAMING • Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data
  • 25. SPEED • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Logistic regression in Hadoop and Spark • Spark has won the Daytona GraySort contest for 2014 (sortbenchmark.org) with 4.27 TB/min (in 2013 Hadoop was the winner with 1.42 TB/min)
  • 26. EASE OF USE • Supports out of the box: • Java • Scala • Python • You can use it interactively from the Scala and Python shells
  • 27. GENERALITY • SQL with SparkSQL • Machine Learning with MLlib • Graphs computation with GraphX • Streaming processing with Spark Streaming
  • 28. RUNS EVERYWHERE • Spark could be run on • Hadoop (YARN) • Mesos • standalone • in the cloud • Spark can read from • S3 • HDFS • HBase • Cassandra • any Hadoop data source.
  • 29.
  • 30. Thank you. • Credentials: • http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014 • http://spark.apache.org • http://www.databricks.com • http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona- 2014-with-apache-spark-and-d3.html