SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Apache Spark 101 
what is Spark all about 
Shahaf Azriely 
Sr. Field Engineer Southern EMEA 
© Copyright 2013 Pivotal. All rights reserved. 1
Agenda 
 What is Spark 
 Spark Programming Model 
– RDDs, log mining, word count … 
 Related Projects 
– Shark, Spark SQL, Spark streaming, Graphx, Mllib and more … 
 So what next 
© Copyright 2013 Pivotal. All rights reserved. 2
What is Spark? 
© Copyright 2013 Pivotal. All rights reserved. 3
The Spark Challenge 
• Data size is growing MapReduce greatly simplified big data analysis 
• But as soon as it got popular, users wanted more: 
- More complex, multi-stage applications (graph algorithms, machine learning) 
- More interactive ad-hoc queries 
- More real-time online processing 
• All of these apps require fast data sharing across parallel jobs 
Pivotal Confidential–Internal Use Only
Data Sharing in MapReduce 
Pivotal Confidential–Internal Use Only 
iter. 1 iter. 2 . . . 
Input 
HDFS 
read 
HDFS 
write 
HDFS 
read 
HDFS 
write 
Input 
query 1 
query 2 
query 3 
result 1 
result 2 
result 3 
. . . 
HDFS 
read 
Slow due to replication, serialization, and disk IO
Data Sharing in Spark 
Pivotal Confidential–Internal Use Only 
iter. 1 iter. 2 . . . 
Input 
Distributed 
memory 
Input 
query 1 
query 2 
query 3 
. . . 
one-time 
processing 
10-100× faster than network and disk
Spark is 
 Fast MapReduce-like engine. 
– In memory storage for fast iterative computation. 
– Design for low latency ~100ms jobs 
 Competitive with Hadoop storage APIs 
– Read/write to any Hadoop supported systems including Pivotal HD. 
 Designed to work with data in memory 
 Programmatic or Interactive 
 Written in Scala but have bindings for Python/Java /Scala. 
 Make life easy and productive for Data Scientists 
Spark is one of the most actively developed open source projects. It has over 465 
contributors in 2014, making it the most active project in the Apache Software 
Foundation and among Big Data open source projects. 
© Copyright 2013 Pivotal. All rights reserved. 7
Short History 
 Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009. 
 2010 Open Sourced 
 June 21 2013 the project was donated to the Apache Software Foundation and it’s founders 
created Databricks out of AmpLab. 
 Feb 27 2014 Spark becomes top level ASF project. 
 In November 2014, the engineering team at Databricks used Spark and set am amazing record in 
the Daytona GraySort sorting 100TB (1 trillion records) in 23 Min 4.27 TB/min. 
 http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting. 
html 
© Copyright 2013 Pivotal. All rights reserved. 8
Spark Programming 
Model 
RDDs in Detail 
© Copyright 2013 Pivotal. All rights reserved. 9
Programming Model 
• Key idea: resilient distributed datasets (RDDs) 
- Resilient – if data in memory is lost, it can be recreated. 
- Distributed – stored in memory across the cluster. 
- Dataset – initial data can be created from a file or programmatically. 
• Parallel operations on RDDs 
- Reduce, collect, foreach, … 
• Interface 
- Clean language-integrated API in Scala, Python, Java 
- Can be used interactively 
Pivotal Confidential–Internal Use Only
RDD Fault Tolerance 
RDDs maintain lineage information that can be used to reconstruct lost partitions 
cachedMsgs = textFile(...).map(_.split(‘t’)(2)) 
.filter(_.contains(“error”)) 
.cache() 
HdfsRDD 
path: hdfs://… 
FilteredRDD 
func: contains(...) 
MappedRDD 
func: split(…) 
CachedRDD 
© Copyright 2013 Pivotal. All rights reserved. 11
Demo: Intro & Log Mining 
1 Create basic RDD in Scala: 2 
Log Mining - Load error messages from a log into memory, then 
interactively search for various patterns 
Base RDD 
Transformed RDD 
Action 
© Copyright 2013 Pivotal. All rights reserved. 12
Transformation and Actions 
 Transformations 
– Map 
– filter 
– flatMap 
– sample 
– groupByKey 
– reduceByKey 
– union 
– join 
– sort 
 Actions 
– count 
– collect 
– reduce 
– lookup 
– save 
Look at 
http://spark.apache.org/docs/latest/progra 
mming-guide.html#basics 
© Copyright 2013 Pivotal. All rights reserved. 13
More Demo: Word count & Joins 
3 Word count in Scala and python shells 4 Join two RDDs 
© Copyright 2013 Pivotal. All rights reserved. 14
Example of Related 
Projects 
© Copyright 2013 Pivotal. All rights reserved. 15
Related Projects 
 Shark is dead long live Spark SQL 
 Spark Streaming 
 GraphX 
 MLbase 
 Others 
© Copyright 2013 Pivotal. All rights reserved. 16
Shark is dead but what it was 
 Hive on Spark 
– HiveQL, UDFs, etc. 
 Turn SQL into RDD 
– Part of the lineage 
 Based on Hive, but takes advantage of Spark for 
– Fast Scheduling 
– Queries are DAGs of jobs, not chained M/R 
– Fast broadcast variables 
© Apache Software Foundation 
© Copyright 2013 Pivotal. All rights reserved. 17
Spark SQL 
 Lib in Spark Core to treat RDDs as relations SchemaRDD 
 RDDs are columnar memory store. 
 Dynamic query optimization 
 Lighter weight version of Shark 
– No code from Hive 
 Import/Export in different Storage formats 
– Parquet, learn schema from existing Hive warehouse 
© Copyright 2013 Pivotal. All rights reserved. 18
Spark SQL Code 
© Copyright 2013 Pivotal. All rights reserved. 19
Spark Streaming 
• Framework for large scale stream processing 
• Scales to 100s of nodes 
• Can achieve second scale latencies 
• Integrates with Spark’s batch and interactive processing 
• Provides a simple batch-like API for implementing complex algorithm 
• Can absorb live data streams from Kafka, Flume, ZeroMQ, etc. 
© Copyright 2013 Pivotal. All rights reserved. 20
Traditional streaming 
• Traditional streaming systems have a event-driven record-at-a-time processing model 
– Each node has mutable state 
– For each record, update state & send new records 
• State is lost if node dies! 
• Making stateful stream processing be fault-tolerant is challenging 
© Copyright 2013 Pivotal. All rights reserved. 21
Discretized Stream Processing 
Run a streaming computation as a series of very small, 
deterministic batch jobs 
live data stream 
batches of X seconds 
processed 
results 
22 
Spark 
Streaming 
Spark 
• Chop up the live stream into batches of X 
seconds 
• Spark treats each batch of data as RDDs and 
processes them using RDD operations 
• Finally, the processed results of the RDD 
operations are returned in batches 
© Copyright 2013 Pivotal. All rights reserved. 22
Discretized Stream Processing 
Run a streaming computation as a series of very small, 
deterministic batch jobs 
live data stream 
batches of X seconds 
processed 
results 
23 
Spark 
Streaming 
Spark 
• Batch sizes as low as ½ second, latency ~ 1 
second 
• Potential for combining batch processing and 
streaming processing in the same system 
© Copyright 2013 Pivotal. All rights reserved. 23
How Fast Can It Go? 
Can process 4 GB/s (42M records/s) of data on 100 nodes at sub-second latency 
Recovers from failures within 1 sec 
© Copyright 2013 Pivotal. All rights reserved. 24
Streaming how does it works 
© Copyright 2013 Pivotal. All rights reserved. 25
MLlib 
 MLlib is a Spark subproject providing machine learning primitives. 
 It ships with Spark as a standard component. 
 Many different Algorithms 
– Classification, Regression, Collaborative Filtering: 
o Regression: generalized linear regression (GLM) 
o Collaborative filtering: alternating least squares (ALS) 
o Clustering: k-means 
o Decomposition: singular value decomposition (SVD), principal 
o Component analysis (PCA) 
© Copyright 2013 Pivotal. All rights reserved. 26
Why MLlib 
 It is built on Apache Spark, a fast and general engine for large-scale data processing. 
 Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. 
 Write applications quickly in Java, Scala, or Python. 
 You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into 
Hadoop workflows. 
© Copyright 2013 Pivotal. All rights reserved. 27
Spark SQL + MLlib 
© Copyright 2013 Pivotal. All rights reserved. 28
GraphX 
 What are Graphs? They are inherently recursive data-structures as properties of vertices depend 
on properties of their neighbors which in turn depend on properties of their neighbors. 
 GraphX unifies ETL, exploratory analysis, and iterative graph computation within a single system. 
 We can view the same data as both graphs and collections, transform and join graphs with RDDs. 
 For example Predicting things about people (eg: political bias) 
– Look at posts, apply classifier, try to predict attribute 
– Look at context of social network to improve prediction 
© Copyright 2013 Pivotal. All rights reserved. 29
GraphX Demo 
© Copyright 2013 Pivotal. All rights reserved. 30
Others 
 Mesos 
– Enable multiple frameworks to share same cluster resources 
– Twitter is largest user: Over 6,000 servers 
 Tachyon 
– In-memory, fault tolerant file system that exposes HDFS. 
– Use as the FS for Spark. 
 Catalyst 
– SQL Query Optimizer 
© Copyright 2013 Pivotal. All rights reserved. 31
So We is Spark 
important for Pivotal 
© Copyright 2013 Pivotal. All rights reserved. 32
So How Real is Spark? 
 Leveraging modern MapReduce engine and techs from database, Spark support both SQL and 
complex analytics efficiently. 
 There are many indicators that Spark is heading to success 
– Solid technology 
– Good buzz 
– Community is getting bigger https://cwiki.apache.org/confluence/display/SPARK/Committers 
© Copyright 2013 Pivotal. All rights reserved. 33
Pivotal’s Positioning of Spark 
Map-Reduce S?park HAWQ Gemfire XD 
Better/Faster 
Batch Processing Near-Real Time Real Time 
Batch Processing 
• PHD is a highly differentiated and the only platform that brings the benefits of closed loop 
analytics to enable business data lake 
• With Spark we extend that differentiation by allowing up to 100x faster batch processing 
© Copyright 2013 Pivotal. All rights reserved. 34
 Spark 1.0.0 on PHD https://support.pivotal.io/hc/en-us/articles/203271897-Spark-on-Pivotal- 
Hadoop-2-0-Quick-Start-Guide 
 Databrick’s announcing Pivotal certification https://databricks.com/blog/2014/05/23/pivotal-hadoop- 
integrates-the-full-apache-spark-stack.html 
 We attend Spark meetup’s 
 Join the SocialCast group! 
© Copyright 2013 Pivotal. All rights reserved. 35
Thank you 
Q&A 
© Copyright 2013 Pivotal. All rights reserved. 36

Weitere ähnliche Inhalte

Was ist angesagt?

Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
Saniya Khalsa
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark Summit
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
vithakur
 

Was ist angesagt? (20)

Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Apache Spark MLlib
Apache Spark MLlib Apache Spark MLlib
Apache Spark MLlib
 
Learning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark ProgrammingLearning spark ch06 - Advanced Spark Programming
Learning spark ch06 - Advanced Spark Programming
 
Project Voldemort
Project VoldemortProject Voldemort
Project Voldemort
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Cassandra
CassandraCassandra
Cassandra
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Dynamodb Presentation
Dynamodb PresentationDynamodb Presentation
Dynamodb Presentation
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overview
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
try
trytry
try
 
Voldemort Nosql
Voldemort NosqlVoldemort Nosql
Voldemort Nosql
 
Dynamo db pros and cons
Dynamo db  pros and consDynamo db  pros and cons
Dynamo db pros and cons
 
Operationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer NoriOperationalizing Machine Learning at Scale with Sameer Nori
Operationalizing Machine Learning at Scale with Sameer Nori
 
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 

Andere mochten auch

Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Spark Summit
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Spark Summit
 

Andere mochten auch (6)

HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
 
Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]Apache Spark 101 [in 50 min]
Apache Spark 101 [in 50 min]
 
Spark Overview - Oleg Mürk
Spark Overview - Oleg MürkSpark Overview - Oleg Mürk
Spark Overview - Oleg Mürk
 
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick EvansRealtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans
 

Ähnlich wie Spark 101

Ähnlich wie Spark 101 (20)

Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)Spark For Plain Old Java Geeks (June2014 Meetup)
Spark For Plain Old Java Geeks (June2014 Meetup)
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Spark 101

  • 1. Apache Spark 101 what is Spark all about Shahaf Azriely Sr. Field Engineer Southern EMEA © Copyright 2013 Pivotal. All rights reserved. 1
  • 2. Agenda  What is Spark  Spark Programming Model – RDDs, log mining, word count …  Related Projects – Shark, Spark SQL, Spark streaming, Graphx, Mllib and more …  So what next © Copyright 2013 Pivotal. All rights reserved. 2
  • 3. What is Spark? © Copyright 2013 Pivotal. All rights reserved. 3
  • 4. The Spark Challenge • Data size is growing MapReduce greatly simplified big data analysis • But as soon as it got popular, users wanted more: - More complex, multi-stage applications (graph algorithms, machine learning) - More interactive ad-hoc queries - More real-time online processing • All of these apps require fast data sharing across parallel jobs Pivotal Confidential–Internal Use Only
  • 5. Data Sharing in MapReduce Pivotal Confidential–Internal Use Only iter. 1 iter. 2 . . . Input HDFS read HDFS write HDFS read HDFS write Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Slow due to replication, serialization, and disk IO
  • 6. Data Sharing in Spark Pivotal Confidential–Internal Use Only iter. 1 iter. 2 . . . Input Distributed memory Input query 1 query 2 query 3 . . . one-time processing 10-100× faster than network and disk
  • 7. Spark is  Fast MapReduce-like engine. – In memory storage for fast iterative computation. – Design for low latency ~100ms jobs  Competitive with Hadoop storage APIs – Read/write to any Hadoop supported systems including Pivotal HD.  Designed to work with data in memory  Programmatic or Interactive  Written in Scala but have bindings for Python/Java /Scala.  Make life easy and productive for Data Scientists Spark is one of the most actively developed open source projects. It has over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects. © Copyright 2013 Pivotal. All rights reserved. 7
  • 8. Short History  Spark was initially started by Matei Zaharia at UC Berkeley AMPLab in 2009.  2010 Open Sourced  June 21 2013 the project was donated to the Apache Software Foundation and it’s founders created Databricks out of AmpLab.  Feb 27 2014 Spark becomes top level ASF project.  In November 2014, the engineering team at Databricks used Spark and set am amazing record in the Daytona GraySort sorting 100TB (1 trillion records) in 23 Min 4.27 TB/min.  http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting. html © Copyright 2013 Pivotal. All rights reserved. 8
  • 9. Spark Programming Model RDDs in Detail © Copyright 2013 Pivotal. All rights reserved. 9
  • 10. Programming Model • Key idea: resilient distributed datasets (RDDs) - Resilient – if data in memory is lost, it can be recreated. - Distributed – stored in memory across the cluster. - Dataset – initial data can be created from a file or programmatically. • Parallel operations on RDDs - Reduce, collect, foreach, … • Interface - Clean language-integrated API in Scala, Python, Java - Can be used interactively Pivotal Confidential–Internal Use Only
  • 11. RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions cachedMsgs = textFile(...).map(_.split(‘t’)(2)) .filter(_.contains(“error”)) .cache() HdfsRDD path: hdfs://… FilteredRDD func: contains(...) MappedRDD func: split(…) CachedRDD © Copyright 2013 Pivotal. All rights reserved. 11
  • 12. Demo: Intro & Log Mining 1 Create basic RDD in Scala: 2 Log Mining - Load error messages from a log into memory, then interactively search for various patterns Base RDD Transformed RDD Action © Copyright 2013 Pivotal. All rights reserved. 12
  • 13. Transformation and Actions  Transformations – Map – filter – flatMap – sample – groupByKey – reduceByKey – union – join – sort  Actions – count – collect – reduce – lookup – save Look at http://spark.apache.org/docs/latest/progra mming-guide.html#basics © Copyright 2013 Pivotal. All rights reserved. 13
  • 14. More Demo: Word count & Joins 3 Word count in Scala and python shells 4 Join two RDDs © Copyright 2013 Pivotal. All rights reserved. 14
  • 15. Example of Related Projects © Copyright 2013 Pivotal. All rights reserved. 15
  • 16. Related Projects  Shark is dead long live Spark SQL  Spark Streaming  GraphX  MLbase  Others © Copyright 2013 Pivotal. All rights reserved. 16
  • 17. Shark is dead but what it was  Hive on Spark – HiveQL, UDFs, etc.  Turn SQL into RDD – Part of the lineage  Based on Hive, but takes advantage of Spark for – Fast Scheduling – Queries are DAGs of jobs, not chained M/R – Fast broadcast variables © Apache Software Foundation © Copyright 2013 Pivotal. All rights reserved. 17
  • 18. Spark SQL  Lib in Spark Core to treat RDDs as relations SchemaRDD  RDDs are columnar memory store.  Dynamic query optimization  Lighter weight version of Shark – No code from Hive  Import/Export in different Storage formats – Parquet, learn schema from existing Hive warehouse © Copyright 2013 Pivotal. All rights reserved. 18
  • 19. Spark SQL Code © Copyright 2013 Pivotal. All rights reserved. 19
  • 20. Spark Streaming • Framework for large scale stream processing • Scales to 100s of nodes • Can achieve second scale latencies • Integrates with Spark’s batch and interactive processing • Provides a simple batch-like API for implementing complex algorithm • Can absorb live data streams from Kafka, Flume, ZeroMQ, etc. © Copyright 2013 Pivotal. All rights reserved. 20
  • 21. Traditional streaming • Traditional streaming systems have a event-driven record-at-a-time processing model – Each node has mutable state – For each record, update state & send new records • State is lost if node dies! • Making stateful stream processing be fault-tolerant is challenging © Copyright 2013 Pivotal. All rights reserved. 21
  • 22. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream batches of X seconds processed results 22 Spark Streaming Spark • Chop up the live stream into batches of X seconds • Spark treats each batch of data as RDDs and processes them using RDD operations • Finally, the processed results of the RDD operations are returned in batches © Copyright 2013 Pivotal. All rights reserved. 22
  • 23. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs live data stream batches of X seconds processed results 23 Spark Streaming Spark • Batch sizes as low as ½ second, latency ~ 1 second • Potential for combining batch processing and streaming processing in the same system © Copyright 2013 Pivotal. All rights reserved. 23
  • 24. How Fast Can It Go? Can process 4 GB/s (42M records/s) of data on 100 nodes at sub-second latency Recovers from failures within 1 sec © Copyright 2013 Pivotal. All rights reserved. 24
  • 25. Streaming how does it works © Copyright 2013 Pivotal. All rights reserved. 25
  • 26. MLlib  MLlib is a Spark subproject providing machine learning primitives.  It ships with Spark as a standard component.  Many different Algorithms – Classification, Regression, Collaborative Filtering: o Regression: generalized linear regression (GLM) o Collaborative filtering: alternating least squares (ALS) o Clustering: k-means o Decomposition: singular value decomposition (SVD), principal o Component analysis (PCA) © Copyright 2013 Pivotal. All rights reserved. 26
  • 27. Why MLlib  It is built on Apache Spark, a fast and general engine for large-scale data processing.  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Write applications quickly in Java, Scala, or Python.  You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows. © Copyright 2013 Pivotal. All rights reserved. 27
  • 28. Spark SQL + MLlib © Copyright 2013 Pivotal. All rights reserved. 28
  • 29. GraphX  What are Graphs? They are inherently recursive data-structures as properties of vertices depend on properties of their neighbors which in turn depend on properties of their neighbors.  GraphX unifies ETL, exploratory analysis, and iterative graph computation within a single system.  We can view the same data as both graphs and collections, transform and join graphs with RDDs.  For example Predicting things about people (eg: political bias) – Look at posts, apply classifier, try to predict attribute – Look at context of social network to improve prediction © Copyright 2013 Pivotal. All rights reserved. 29
  • 30. GraphX Demo © Copyright 2013 Pivotal. All rights reserved. 30
  • 31. Others  Mesos – Enable multiple frameworks to share same cluster resources – Twitter is largest user: Over 6,000 servers  Tachyon – In-memory, fault tolerant file system that exposes HDFS. – Use as the FS for Spark.  Catalyst – SQL Query Optimizer © Copyright 2013 Pivotal. All rights reserved. 31
  • 32. So We is Spark important for Pivotal © Copyright 2013 Pivotal. All rights reserved. 32
  • 33. So How Real is Spark?  Leveraging modern MapReduce engine and techs from database, Spark support both SQL and complex analytics efficiently.  There are many indicators that Spark is heading to success – Solid technology – Good buzz – Community is getting bigger https://cwiki.apache.org/confluence/display/SPARK/Committers © Copyright 2013 Pivotal. All rights reserved. 33
  • 34. Pivotal’s Positioning of Spark Map-Reduce S?park HAWQ Gemfire XD Better/Faster Batch Processing Near-Real Time Real Time Batch Processing • PHD is a highly differentiated and the only platform that brings the benefits of closed loop analytics to enable business data lake • With Spark we extend that differentiation by allowing up to 100x faster batch processing © Copyright 2013 Pivotal. All rights reserved. 34
  • 35.  Spark 1.0.0 on PHD https://support.pivotal.io/hc/en-us/articles/203271897-Spark-on-Pivotal- Hadoop-2-0-Quick-Start-Guide  Databrick’s announcing Pivotal certification https://databricks.com/blog/2014/05/23/pivotal-hadoop- integrates-the-full-apache-spark-stack.html  We attend Spark meetup’s  Join the SocialCast group! © Copyright 2013 Pivotal. All rights reserved. 35
  • 36. Thank you Q&A © Copyright 2013 Pivotal. All rights reserved. 36

Hinweis der Redaktion

  1. Each iteration is, for example, a MapReduce job
  2. You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion