SlideShare a Scribd company logo
1 of 37
Download to read offline
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Apache Spark : Fast and Easy Data
Processing
Sujee Maniyam
Founder / Principal
Elephant Scale LLC
sujee@elephantscale.com
http://elephantscale.com
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Outline
r  Background
r  Spark Architecture
r  Demo
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark
r  Fast & Expressive Cluster computing engine
r  Compatible with Hadoop
r  Came out of Berkeley AMP Lab
r  Now Apache project
r  Version 1.2 just released (Dec 2014)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Timeline Hadoop & Spark
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Hypo-meter J
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Job Trends
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Comparison With Hadoop
Hadoop Spark
Distributed Storage + Distributed
Compute
Distributed Compute Only
MapReduce framework Generalized computation
Usually data on disk (HDFS) On disk / in memory
Not ideal for iterative work Great at Iterative workloads
(machine learning ..etc)
Batch process - Upto 2x -10x faster for data on disk
- Upto 100x faster for data in memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Hadoop + Yarn : Universal OS for
Cluster Computing
HDFS
YARN
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Storage
Cluster
Management
Applications
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Vs Hadoop
r  Spark is ‘easier’ than Hadoop
r  ‘friendlier’ for data scientists / analysts
r  Interactive shell
r fast development cycles
r adhoc exploration
r  API supports multiple languages
r Java, Scala, Python
r  Great for small (Gigs) to medium (100s of Gigs)
data
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Vs. Hadoop
Fast!
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Is Spark Replacing Hadoop?
r  Right now, Spark runs on Hadoop / YARN
r Complimentary
r  Can be seen as generic MapReduce
r  Spark is really great if data fits in memory (few
hundred gigs),
r  Spark can be used as compute platform with
various storage types (see next slide)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark & Pluggable Storage
Spark
(compute engine)
HDFS Amazon S3 Cassandra ???
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Hadoop & Spark Future ???
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Outline
r  Background
r  à Spark Architecture
r  Demo
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Eco-System
Spark Core
Spark
SQL
Spark
Streaming
ML lib
Schema / sql Real Time
Machine
Learning
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Core
r  Distributed compute engine
r  Sort / shuffle algorithms
r  Handles node failures -> re-computes missing
pieces
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark SQL
r  Adhoc / interactive analytics
r  ETL workflow
r  Reporting
r  Natively understands
r JSON : {“name”: “mark”, “age”: 40}
r Parquet : on disk columnar format, heavily
compressed
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark SQL Quick Example
{"name":"mark", "age":50, "sex":"M"},
{"name":"mary", "age":45, "sex":"F"},
{"name":"brian", "age":15, "sex":"M"},
{"name":"nancy", "age":17, "sex":"F"}
// … setup …
val teenagers = sqlContext.sql("SELECT name
FROM people WHERE age >= 13 AND age <= 19")
è  [brian, nancy]
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Streaming
r  Process data streams in real time, in micro-
batches
r  Low latency, high throughput (1000s events /
sec)
r  Stock ticks / sensor data (connected devices /
IoT – Internet of Things)
Streaming
Sources
Storage
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Machine Learning (ML Lib)
r  Machine learning at scale
r  Out of the box ML capabilities !
r  Java / Scala / Python language support
r  Lots of common algorithms are supported
r  Classification / Regressions
r  Linear models (linear R, logistic regression, SVM)
r  Decision trees
r  Collaborative filtering (recommendations)
r  K-Means clustering
r  More to come
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Cluster Deployment
1)  Standalone
2)  Yarn
3)  Mesos
Client
Node
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Cluster Architecture
r  Multiple ‘applications’ can run at the same time
r  Driver (or ‘main’) launches an application
r  Each application gets its own ‘executor’
r  Isolated (runs in different JVMs)
r  Also means data can not be shared across applications
r  Cluster Managers:
r  multiple cluster managers are supported
r  1) Standalone : simple to setup
r  2) YARN : on top of Hadoop
r  3) Mesos : General cluster manager (AMP lab)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Spark Data Model : RDD
r  Resilient Distributed Dataset (RDD)
r  Can live in
r Memory (best case scenario)
r Or on disk (FS, HDFS, S3 …etc)
r  Each RDD is split into multiple partitions
r  Partitions may live on different nodes
r  Partitions can be computed in parallel on
different nodes
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Partitions Explained
1G file
64 M 64 M 64 M 64 M
Task Task Task Task
Result
parallelism
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD : Loading
r  Use Spark context to load RDDs from disk /
external storage
val sc = new SparkContext(…)
val f = sc.textFile(“/data/input1.txt”) // single file
sc.textFile(“/data/”) // load all files under dir
sc.textFile(“/data/*.log”) // wild card matching
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD Operations
r  Two kinds of operations on RDDs
r 1) Transformations
r Create a new RDD from existing ones (e.g. Map)
r 2) Actions
r E.g. Returns the results to clients (e.g. Reduce)
r  Transformations are lazy.. Actions force
transformations
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Transformations / Actions
RDD 1 RDD 2
Client
RDD 3
Transformation 1
(map)
Action1
(collect)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD Transformations
Transformation Description Example
filter Filters through each record
(aka grep)
f.filter( line =>
line.contains(“ERROR”))
union Merges two RDDs rdd1.union(rdd2)
…see docs …
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD Actions
Action Description Example
count() Counts all records in an rdd f.count()
first() Extract the first record f.first ()
take(n) Take first N lines f.take(10)
collect() Gathers all records for RDD.
All data has to fit in memory of
ONE machine (don’t use for big
data sets)
f.collect()
…. See
documentation
..
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
RDD : Saving
r  saveAsTextFile () and saveAsSequenceFile()
f.saveAsTextFile(“/output/directory”) // a directory
r  Output usually is a directory
r RDDs will be saved as multiple files in the dir
r Each partition à one output file
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Caching of RDDs
r  RDDs can be loaded from disk and computed
r Hadoop mapreduce model
r  Also RDDs can be cached in memory
r  Subsequent operations are much faster
f.persist() // on disk or memory
f.cache() // memory only
r  In memory RDDs are
great for iterative workloads
r Machine learning algorithms
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo Time !
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo : Spark-shell
r  Invoke spark shell
r  Load a data set
r  Do basic operations (count / filter)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo: RDD Caching
r  From Spark shell
r  Load an RDD
r  Demonstrate the difference between cached and
non-cached
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Demo : Map Reduce
r  Quick Word count
val input = sc.textFile(“…”)
val counts = input.flatMap(
line => line.split(“ “)).
map(word => (word, 1)).
reduceByKey(_+_)
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Thanks !
Sujee Maniyam
sujee@elephantscale.com
http://elephantscale.com
Expert consulting & training in Big Data
2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved.
Credits
r  http://spark.apache.org/
r  http://www.strategictechplanning.com
r  Tuningpp.com
r  Kidzworld.com

More Related Content

What's hot

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on HadoopMapR Technologies
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleHelena Edelson
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016 Hiromitsu Komatsu
 

What's hot (20)

Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra   webinar feb 16 2016 Kafka spark cassandra   webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
 

Viewers also liked

게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 IntroAmazon Web Services Korea
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationmattlieber
 
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015Amazon Web Services Korea
 
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016Amazon Web Services Korea
 
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)Amazon Web Services Korea
 
Creando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWSCreando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWSAmazon Web Services LATAM
 

Viewers also liked (8)

게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임 서비스 품질 향상을 위한 데이터 분석 활용하기 - 김필중 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
빅 데이터 분석을 위한 AWS 활용 사례 - 최정욱 솔루션즈 아키텍트:: AWS Cloud Track 1 Intro
 
Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
AWS 비용 최적화 기법 (윤석찬) - AWS 웨비나 시리즈 2015
 
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
빅데이터를 위한 AWS 모범사례와 아키텍처 구축 패턴 :: 양승도 :: AWS Summit Seoul 2016
 
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
아마존웹서비스와 함께하는 클라우드 비용 최적화 전략 - 윤석찬 (AWS 코리아 테크에반젤리스트)
 
Zookeeper 소개
Zookeeper 소개Zookeeper 소개
Zookeeper 소개
 
Creando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWSCreando su primera aplicación de Big Data en AWS
Creando su primera aplicación de Big Data en AWS
 

Similar to Spark Intro @ analytics big data summit

Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerSynerzip
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724sdeeg
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsYannick Pouliot
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...William Markito Oliveira
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkNicola Ferraro
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataTaro L. Saito
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark Hubert Fan Chiang
 

Similar to Spark Intro @ analytics big data summit (20)

Insight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark KerznerInsight on "From Hadoop to Spark" by Mark Kerzner
Insight on "From Hadoop to Spark" by Mark Kerzner
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...Building a Stock Prediction system with Machine Learning using Geode, SpringX...
Building a Stock Prediction system with Machine Learning using Geode, SpringX...
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Introduction to Apache Spark
Introduction to Apache Spark Introduction to Apache Spark
Introduction to Apache Spark
 

More from Sujee Maniyam

Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of ThingsSujee Maniyam
 
Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Sujee Maniyam
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big DataSujee Maniyam
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscapeSujee Maniyam
 
Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Sujee Maniyam
 

More from Sujee Maniyam (6)

Reference architecture for Internet of Things
Reference architecture for Internet of ThingsReference architecture for Internet of Things
Reference architecture for Internet of Things
 
Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014Building secure NoSQL applications nosqlnow_conf_2014
Building secure NoSQL applications nosqlnow_conf_2014
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Launching your career in Big Data
Launching your career in Big DataLaunching your career in Big Data
Launching your career in Big Data
 
Hadoop security landscape
Hadoop security landscapeHadoop security landscape
Hadoop security landscape
 
Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)Iphone client-server app with Rails backend (v3)
Iphone client-server app with Rails backend (v3)
 

Recently uploaded

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 

Recently uploaded (20)

Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Spark Intro @ analytics big data summit

  • 1. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Apache Spark : Fast and Easy Data Processing Sujee Maniyam Founder / Principal Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
  • 2. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Outline r  Background r  Spark Architecture r  Demo
  • 3. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark r  Fast & Expressive Cluster computing engine r  Compatible with Hadoop r  Came out of Berkeley AMP Lab r  Now Apache project r  Version 1.2 just released (Dec 2014)
  • 4. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Timeline Hadoop & Spark
  • 5. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hypo-meter J
  • 6. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Job Trends
  • 7. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Comparison With Hadoop Hadoop Spark Distributed Storage + Distributed Compute Distributed Compute Only MapReduce framework Generalized computation Usually data on disk (HDFS) On disk / in memory Not ideal for iterative work Great at Iterative workloads (machine learning ..etc) Batch process - Upto 2x -10x faster for data on disk - Upto 100x faster for data in memory Compact code Java, Python, Scala supported Shell for ad-hoc exploration
  • 8. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hadoop + Yarn : Universal OS for Cluster Computing HDFS YARN Batch (mapreduce) Streaming (storm, S4) In-memory (spark) Storage Cluster Management Applications
  • 9. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Vs Hadoop r  Spark is ‘easier’ than Hadoop r  ‘friendlier’ for data scientists / analysts r  Interactive shell r fast development cycles r adhoc exploration r  API supports multiple languages r Java, Scala, Python r  Great for small (Gigs) to medium (100s of Gigs) data
  • 10. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Vs. Hadoop Fast!
  • 11. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Is Spark Replacing Hadoop? r  Right now, Spark runs on Hadoop / YARN r Complimentary r  Can be seen as generic MapReduce r  Spark is really great if data fits in memory (few hundred gigs), r  Spark can be used as compute platform with various storage types (see next slide)
  • 12. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark & Pluggable Storage Spark (compute engine) HDFS Amazon S3 Cassandra ???
  • 13. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Hadoop & Spark Future ???
  • 14. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Outline r  Background r  à Spark Architecture r  Demo
  • 15. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Eco-System Spark Core Spark SQL Spark Streaming ML lib Schema / sql Real Time Machine Learning
  • 16. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Core r  Distributed compute engine r  Sort / shuffle algorithms r  Handles node failures -> re-computes missing pieces
  • 17. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark SQL r  Adhoc / interactive analytics r  ETL workflow r  Reporting r  Natively understands r JSON : {“name”: “mark”, “age”: 40} r Parquet : on disk columnar format, heavily compressed
  • 18. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark SQL Quick Example {"name":"mark", "age":50, "sex":"M"}, {"name":"mary", "age":45, "sex":"F"}, {"name":"brian", "age":15, "sex":"M"}, {"name":"nancy", "age":17, "sex":"F"} // … setup … val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") è  [brian, nancy]
  • 19. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Streaming r  Process data streams in real time, in micro- batches r  Low latency, high throughput (1000s events / sec) r  Stock ticks / sensor data (connected devices / IoT – Internet of Things) Streaming Sources Storage
  • 20. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Machine Learning (ML Lib) r  Machine learning at scale r  Out of the box ML capabilities ! r  Java / Scala / Python language support r  Lots of common algorithms are supported r  Classification / Regressions r  Linear models (linear R, logistic regression, SVM) r  Decision trees r  Collaborative filtering (recommendations) r  K-Means clustering r  More to come
  • 21. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Cluster Deployment 1)  Standalone 2)  Yarn 3)  Mesos Client Node
  • 22. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Cluster Architecture r  Multiple ‘applications’ can run at the same time r  Driver (or ‘main’) launches an application r  Each application gets its own ‘executor’ r  Isolated (runs in different JVMs) r  Also means data can not be shared across applications r  Cluster Managers: r  multiple cluster managers are supported r  1) Standalone : simple to setup r  2) YARN : on top of Hadoop r  3) Mesos : General cluster manager (AMP lab)
  • 23. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Spark Data Model : RDD r  Resilient Distributed Dataset (RDD) r  Can live in r Memory (best case scenario) r Or on disk (FS, HDFS, S3 …etc) r  Each RDD is split into multiple partitions r  Partitions may live on different nodes r  Partitions can be computed in parallel on different nodes
  • 24. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Partitions Explained 1G file 64 M 64 M 64 M 64 M Task Task Task Task Result parallelism
  • 25. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD : Loading r  Use Spark context to load RDDs from disk / external storage val sc = new SparkContext(…) val f = sc.textFile(“/data/input1.txt”) // single file sc.textFile(“/data/”) // load all files under dir sc.textFile(“/data/*.log”) // wild card matching
  • 26. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Operations r  Two kinds of operations on RDDs r 1) Transformations r Create a new RDD from existing ones (e.g. Map) r 2) Actions r E.g. Returns the results to clients (e.g. Reduce) r  Transformations are lazy.. Actions force transformations
  • 27. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Transformations / Actions RDD 1 RDD 2 Client RDD 3 Transformation 1 (map) Action1 (collect)
  • 28. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Transformations Transformation Description Example filter Filters through each record (aka grep) f.filter( line => line.contains(“ERROR”)) union Merges two RDDs rdd1.union(rdd2) …see docs …
  • 29. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD Actions Action Description Example count() Counts all records in an rdd f.count() first() Extract the first record f.first () take(n) Take first N lines f.take(10) collect() Gathers all records for RDD. All data has to fit in memory of ONE machine (don’t use for big data sets) f.collect() …. See documentation ..
  • 30. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. RDD : Saving r  saveAsTextFile () and saveAsSequenceFile() f.saveAsTextFile(“/output/directory”) // a directory r  Output usually is a directory r RDDs will be saved as multiple files in the dir r Each partition à one output file
  • 31. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Caching of RDDs r  RDDs can be loaded from disk and computed r Hadoop mapreduce model r  Also RDDs can be cached in memory r  Subsequent operations are much faster f.persist() // on disk or memory f.cache() // memory only r  In memory RDDs are great for iterative workloads r Machine learning algorithms
  • 32. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo Time !
  • 33. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo : Spark-shell r  Invoke spark shell r  Load a data set r  Do basic operations (count / filter)
  • 34. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo: RDD Caching r  From Spark shell r  Load an RDD r  Demonstrate the difference between cached and non-cached
  • 35. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Demo : Map Reduce r  Quick Word count val input = sc.textFile(“…”) val counts = input.flatMap( line => line.split(“ “)). map(word => (word, 1)). reduceByKey(_+_)
  • 36. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Thanks ! Sujee Maniyam sujee@elephantscale.com http://elephantscale.com Expert consulting & training in Big Data
  • 37. 2015 SNIA Analytics and Big Data Summit. © Elephant Scale LLC. All Rights Reserved. Credits r  http://spark.apache.org/ r  http://www.strategictechplanning.com r  Tuningpp.com r  Kidzworld.com