SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Programming in Spark using PySpark
Mostafa Elzoghbi
Sr. Technical Evangelist – Microsoft
@MostafaElzoghbi
http://mostafa.rocks
Session Objectives & Takeaways
• Programming Spark
• Spark Program Structure
• Working with RDDs
• Transformations versus Actions
• Lambda, Shared Variables (Broadcast vs accumulators)
• Visualizing big data in Spark
• Spark in the cloud (Azure)
• Working with cluster types, notebooks, scaling.
Python Spark (pySpark)
• We are using the Python programming interface to Spark (pySpark)
• pySpark provides an easy-to-use programming abstraction and parallel
runtime:
“Here’s an operation, run it on all of the data”
• RDDs are the key concept
Apache Spark Driver and
Workers
• A Spark program is two programs:
• A driver program and a workers program
• Worker programs run on cluster nodes or in
local threads
• RDDs (Resilient Distributed Datasets) are
distributed
Spark Essentials: Master
• The master parameter for a SparkContext determines which type and size of
cluster to use
Spark Context
• A Spark program first creates a SparkContext object
» Tells Spark how and where to access a cluster
» pySpark shell and Databricks cloud automatically create the sc variable
» iPython and programs must use a constructor to create a new
SparkContext
• Use SparkContext to create RDDs
Resilient Distributed Datasets
• The primary abstraction in Spark
» Immutable once constructed
» Track lineage information to efficiently recompute lost data
» Enable operations on collection of elements in parallel
• You construct RDDs
» by parallelizing existing Python collections (lists)
» by transforming an existing RDDs
» from files in HDFS or any other storage system
RDDs
• Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in
parallel.
• Two types of operations: transformations and actions
• Transformations are lazy (not computed immediately)
• Transformed RDD is executed when action runs on it
• Persist (cache) RDDs in memory or disk
Creating an RDD
• Create RDDs from Python collections (lists)
• From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles,
any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*
Working with RDDs
• Create an RDD from a data source: <list>
• Apply transformations to an RDD: map filter
• Apply actions to an RDD: collect count
Spark Transformations
• Create new datasets from an existing one
• Use lazy evaluation: results not computed right away –
• instead Spark remembers set of transformations applied to base dataset
» Spark optimizes the required calculations
» Spark recovers from failures and slow workers
• Think of this as a recipe for creating result
Python lambda Functions
• Small anonymous functions (not bound to a name)
lambda a, b: a+b
» returns the sum of its two arguments
• Can use lambda functions wherever function objects are required
• Restricted to a single expression
Spark Actions
• Cause Spark to execute recipe to transform source
• Mechanism for getting results out of Spark
Spark Program Lifecycle
1. Create RDDs from external data or parallelize a collection in your driver
program
2. Lazily transform them into new RDDs
3. cache() some RDDs for reuse -- IMPORTANT
4. Perform actions to execute parallel
5. Computation and produce results
pySpark Shared Variables
• Broadcast Variables
» Efficiently send large, read-only value to all workers
» Saved at workers for use in one or more Spark operations
» Like sending a large, read-only lookup table to all the nodes
At the driver: broadcastVar = sc.broadcast([1, 2, 3])
At a worker: broadcastVar.value
• Accumulators
» Aggregate values from workers back to driver
» Only driver can access value of accumulator
» For tasks, accumulators are write-only
» Use to count errors seen in RDD across workers
>>> accum = sc.accumulator(0)
>>> rdd = sc.parallelize([1, 2, 3, 4])
>>> def f(x):
>>> global accum
>>> accum += x
>>> rdd.foreach(f)
>>> accum.value
Value: 10
Visualizing Big Data in the browser
• Challenges:
• Manipulating large data can take long time
Memory: caching -> Scale clusters
CPU: Parallelism -> Scale clusters
• We have more data points than possible pixels
> Summarize: Aggregation, Pivoting (more data than pixels)
> Model (Clustering, Classification, D. Reduction, …etc)
> Sample: approximate (faster) and exact sampling
• Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.
Spark Kernels and MAGIC keywords
• PySpark kernel supports set of %%MAGIC keywords
• It supports built-in IPython built-in magics, including %%sh.
• Auto visualization
• Magic keywords:
• %%SQL % Spark SQL
• %%lsmagic % List all supported magic keywords (Important)
• %env % Set environment variable
• %run % Execute python code
• %who % List all variables of global scope
• Run code from a different kernel in a notebook.
Spark in Azure
Hadoop clusters in Azure are packaged under “HDInsight” service
Spark in Azure
• Create clusters in few clicks
• Apache Spark comes only in Linux OS.
• Multiple HDP versions
• Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets.
• Multiple Storage options:
• Azure Storage
• ADL store
• External metadata store in SQL server database for
Hive and Oozie.
• All notebooks are stored in the storage account
associated with Spark cluster
• Zeppelin notebook is available on certain Spark
versions but not all.
Programming Spark Apps in HDInsight
• Supports four kernels in Jupyter in HDInsight Spark clusters in Azure
DEMO
Spark Apps using Jupyter
References
• Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
• edx.org: Free Apache Spark courses
• Visualizations for Databricks
https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20
Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Ov
erview.html
• SPARKHub by Databricks
https://sparkhub.databricks.com/resources/
Thank you
• Check out my blog big data articles: http://mostafa.rocks
• Follow me on Twitter: @MostafaElzoghbi
• Want some help in building cloud solutions? Contact me to know more.

Weitere ähnliche Inhalte

Was ist angesagt?

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

Was ist angesagt? (20)

Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Spark
SparkSpark
Spark
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

Andere mochten auch

Modern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesModern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesMarkus Winand
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in AzureMostafa
 
Introducing Power BI Embedded
Introducing Power BI EmbeddedIntroducing Power BI Embedded
Introducing Power BI EmbeddedMostafa
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Building predictive models in Azure Machine Learning
Building predictive models in Azure Machine LearningBuilding predictive models in Azure Machine Learning
Building predictive models in Azure Machine LearningMostafa
 
Build intelligent solutions using Azure
Build intelligent solutions using AzureBuild intelligent solutions using Azure
Build intelligent solutions using AzureMostafa
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
Extending Product Outreach with Outlook Connectors
Extending Product Outreach with Outlook ConnectorsExtending Product Outreach with Outlook Connectors
Extending Product Outreach with Outlook ConnectorsMostafa
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning ClassifiersMostafa
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013larsgeorge
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Architecting big data solutions in the cloud
Architecting big data solutions in the cloudArchitecting big data solutions in the cloud
Architecting big data solutions in the cloudMostafa
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 

Andere mochten auch (20)

Modern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesModern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial Databases
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Introducing Power BI Embedded
Introducing Power BI EmbeddedIntroducing Power BI Embedded
Introducing Power BI Embedded
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Building predictive models in Azure Machine Learning
Building predictive models in Azure Machine LearningBuilding predictive models in Azure Machine Learning
Building predictive models in Azure Machine Learning
 
Build intelligent solutions using Azure
Build intelligent solutions using AzureBuild intelligent solutions using Azure
Build intelligent solutions using Azure
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Extending Product Outreach with Outlook Connectors
Extending Product Outreach with Outlook ConnectorsExtending Product Outreach with Outlook Connectors
Extending Product Outreach with Outlook Connectors
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Architecting big data solutions in the cloud
Architecting big data solutions in the cloudArchitecting big data solutions in the cloud
Architecting big data solutions in the cloud
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 

Ähnlich wie Programming in Spark using PySpark

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
 

Ähnlich wie Programming in Spark using PySpark (20)

Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Spark core
Spark coreSpark core
Spark core
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Apache spark
Apache sparkApache spark
Apache spark
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 

Mehr von Mostafa

The role of intelligent sensors in the cloud public
The role of intelligent sensors in the cloud publicThe role of intelligent sensors in the cloud public
The role of intelligent sensors in the cloud publicMostafa
 
Skill up in machine learning using Azure ML
Skill up in machine learning using Azure MLSkill up in machine learning using Azure ML
Skill up in machine learning using Azure MLMostafa
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare Mostafa
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in AzureMostafa
 
Patterns and Practices in Building Office Add-ins
Patterns and Practices in Building Office Add-insPatterns and Practices in Building Office Add-ins
Patterns and Practices in Building Office Add-insMostafa
 
Data science essentials in azure ml
Data science essentials in azure mlData science essentials in azure ml
Data science essentials in azure mlMostafa
 
Build Interactive Analytics using Power BI
Build Interactive Analytics using Power BIBuild Interactive Analytics using Power BI
Build Interactive Analytics using Power BIMostafa
 
TypeScript Jump Start
TypeScript Jump StartTypeScript Jump Start
TypeScript Jump StartMostafa
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azureMostafa
 
Build intelligent solutions using ms azure
Build intelligent solutions using ms azureBuild intelligent solutions using ms azure
Build intelligent solutions using ms azureMostafa
 
Mistakes that kill startups
Mistakes that kill startupsMistakes that kill startups
Mistakes that kill startupsMostafa
 
PnP in building office add ins - public
PnP in building office add ins - publicPnP in building office add ins - public
PnP in building office add ins - publicMostafa
 
How to migrate Console Apps as a cloud service
How to migrate Console Apps as a cloud serviceHow to migrate Console Apps as a cloud service
How to migrate Console Apps as a cloud serviceMostafa
 
HBase introduction in azure
HBase introduction in azureHBase introduction in azure
HBase introduction in azureMostafa
 
Get your site microsoft edge ready
Get your site microsoft edge readyGet your site microsoft edge ready
Get your site microsoft edge readyMostafa
 
Developing cross platform mobile apps using Apache Cordova
Developing cross platform mobile apps using Apache CordovaDeveloping cross platform mobile apps using Apache Cordova
Developing cross platform mobile apps using Apache CordovaMostafa
 
Identity and o365 on Azure
Identity and o365 on AzureIdentity and o365 on Azure
Identity and o365 on AzureMostafa
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platformMostafa
 
Building IoT solutions using Windows 10 IoT Core & Azure
Building IoT solutions using Windows 10 IoT Core & AzureBuilding IoT solutions using Windows 10 IoT Core & Azure
Building IoT solutions using Windows 10 IoT Core & AzureMostafa
 

Mehr von Mostafa (20)

The role of intelligent sensors in the cloud public
The role of intelligent sensors in the cloud publicThe role of intelligent sensors in the cloud public
The role of intelligent sensors in the cloud public
 
Skill up in machine learning using Azure ML
Skill up in machine learning using Azure MLSkill up in machine learning using Azure ML
Skill up in machine learning using Azure ML
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Patterns and Practices in Building Office Add-ins
Patterns and Practices in Building Office Add-insPatterns and Practices in Building Office Add-ins
Patterns and Practices in Building Office Add-ins
 
Data science essentials in azure ml
Data science essentials in azure mlData science essentials in azure ml
Data science essentials in azure ml
 
Build Interactive Analytics using Power BI
Build Interactive Analytics using Power BIBuild Interactive Analytics using Power BI
Build Interactive Analytics using Power BI
 
TypeScript Jump Start
TypeScript Jump StartTypeScript Jump Start
TypeScript Jump Start
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
Build intelligent solutions using ms azure
Build intelligent solutions using ms azureBuild intelligent solutions using ms azure
Build intelligent solutions using ms azure
 
Mistakes that kill startups
Mistakes that kill startupsMistakes that kill startups
Mistakes that kill startups
 
PnP in building office add ins - public
PnP in building office add ins - publicPnP in building office add ins - public
PnP in building office add ins - public
 
How to migrate Console Apps as a cloud service
How to migrate Console Apps as a cloud serviceHow to migrate Console Apps as a cloud service
How to migrate Console Apps as a cloud service
 
HBase introduction in azure
HBase introduction in azureHBase introduction in azure
HBase introduction in azure
 
eRecall
eRecalleRecall
eRecall
 
Get your site microsoft edge ready
Get your site microsoft edge readyGet your site microsoft edge ready
Get your site microsoft edge ready
 
Developing cross platform mobile apps using Apache Cordova
Developing cross platform mobile apps using Apache CordovaDeveloping cross platform mobile apps using Apache Cordova
Developing cross platform mobile apps using Apache Cordova
 
Identity and o365 on Azure
Identity and o365 on AzureIdentity and o365 on Azure
Identity and o365 on Azure
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Building IoT solutions using Windows 10 IoT Core & Azure
Building IoT solutions using Windows 10 IoT Core & AzureBuilding IoT solutions using Windows 10 IoT Core & Azure
Building IoT solutions using Windows 10 IoT Core & Azure
 

Kürzlich hochgeladen

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Kürzlich hochgeladen (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

Programming in Spark using PySpark

  • 1. Programming in Spark using PySpark Mostafa Elzoghbi Sr. Technical Evangelist – Microsoft @MostafaElzoghbi http://mostafa.rocks
  • 2. Session Objectives & Takeaways • Programming Spark • Spark Program Structure • Working with RDDs • Transformations versus Actions • Lambda, Shared Variables (Broadcast vs accumulators) • Visualizing big data in Spark • Spark in the cloud (Azure) • Working with cluster types, notebooks, scaling.
  • 3. Python Spark (pySpark) • We are using the Python programming interface to Spark (pySpark) • pySpark provides an easy-to-use programming abstraction and parallel runtime: “Here’s an operation, run it on all of the data” • RDDs are the key concept
  • 4. Apache Spark Driver and Workers • A Spark program is two programs: • A driver program and a workers program • Worker programs run on cluster nodes or in local threads • RDDs (Resilient Distributed Datasets) are distributed
  • 5. Spark Essentials: Master • The master parameter for a SparkContext determines which type and size of cluster to use
  • 6. Spark Context • A Spark program first creates a SparkContext object » Tells Spark how and where to access a cluster » pySpark shell and Databricks cloud automatically create the sc variable » iPython and programs must use a constructor to create a new SparkContext • Use SparkContext to create RDDs
  • 7. Resilient Distributed Datasets • The primary abstraction in Spark » Immutable once constructed » Track lineage information to efficiently recompute lost data » Enable operations on collection of elements in parallel • You construct RDDs » by parallelizing existing Python collections (lists) » by transforming an existing RDDs » from files in HDFS or any other storage system
  • 8. RDDs • Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. • Two types of operations: transformations and actions • Transformations are lazy (not computed immediately) • Transformed RDD is executed when action runs on it • Persist (cache) RDDs in memory or disk
  • 9.
  • 10. Creating an RDD • Create RDDs from Python collections (lists) • From HDFS, text files, Hypertable, Amazon S3, Apache Hbase, SequenceFiles, any other Hadoop InputFormat, and directory or glob wildcard: /data/201404*
  • 11. Working with RDDs • Create an RDD from a data source: <list> • Apply transformations to an RDD: map filter • Apply actions to an RDD: collect count
  • 12. Spark Transformations • Create new datasets from an existing one • Use lazy evaluation: results not computed right away – • instead Spark remembers set of transformations applied to base dataset » Spark optimizes the required calculations » Spark recovers from failures and slow workers • Think of this as a recipe for creating result
  • 13. Python lambda Functions • Small anonymous functions (not bound to a name) lambda a, b: a+b » returns the sum of its two arguments • Can use lambda functions wherever function objects are required • Restricted to a single expression
  • 14. Spark Actions • Cause Spark to execute recipe to transform source • Mechanism for getting results out of Spark
  • 15. Spark Program Lifecycle 1. Create RDDs from external data or parallelize a collection in your driver program 2. Lazily transform them into new RDDs 3. cache() some RDDs for reuse -- IMPORTANT 4. Perform actions to execute parallel 5. Computation and produce results
  • 16. pySpark Shared Variables • Broadcast Variables » Efficiently send large, read-only value to all workers » Saved at workers for use in one or more Spark operations » Like sending a large, read-only lookup table to all the nodes At the driver: broadcastVar = sc.broadcast([1, 2, 3]) At a worker: broadcastVar.value
  • 17. • Accumulators » Aggregate values from workers back to driver » Only driver can access value of accumulator » For tasks, accumulators are write-only » Use to count errors seen in RDD across workers >>> accum = sc.accumulator(0) >>> rdd = sc.parallelize([1, 2, 3, 4]) >>> def f(x): >>> global accum >>> accum += x >>> rdd.foreach(f) >>> accum.value Value: 10
  • 18. Visualizing Big Data in the browser • Challenges: • Manipulating large data can take long time Memory: caching -> Scale clusters CPU: Parallelism -> Scale clusters • We have more data points than possible pixels > Summarize: Aggregation, Pivoting (more data than pixels) > Model (Clustering, Classification, D. Reduction, …etc) > Sample: approximate (faster) and exact sampling • Internal Tools: Matplotlib, GGPlot, D3, SVC, and more.
  • 19. Spark Kernels and MAGIC keywords • PySpark kernel supports set of %%MAGIC keywords • It supports built-in IPython built-in magics, including %%sh. • Auto visualization • Magic keywords: • %%SQL % Spark SQL • %%lsmagic % List all supported magic keywords (Important) • %env % Set environment variable • %run % Execute python code • %who % List all variables of global scope • Run code from a different kernel in a notebook.
  • 20. Spark in Azure Hadoop clusters in Azure are packaged under “HDInsight” service
  • 21. Spark in Azure • Create clusters in few clicks • Apache Spark comes only in Linux OS. • Multiple HDP versions • Comes with preloaded: SSH, Hive, Oozie, DLS, Vnets. • Multiple Storage options: • Azure Storage • ADL store • External metadata store in SQL server database for Hive and Oozie. • All notebooks are stored in the storage account associated with Spark cluster • Zeppelin notebook is available on certain Spark versions but not all.
  • 22. Programming Spark Apps in HDInsight • Supports four kernels in Jupyter in HDInsight Spark clusters in Azure
  • 24. References • Spark Programming Guide http://spark.apache.org/docs/latest/programming-guide.html • edx.org: Free Apache Spark courses • Visualizations for Databricks https://docs.cloud.databricks.com/docs/latest/databricks_guide/01%20 Databricks%20Overview/15%20Visualizations/0%20Visualizations%20Ov erview.html • SPARKHub by Databricks https://sparkhub.databricks.com/resources/
  • 25. Thank you • Check out my blog big data articles: http://mostafa.rocks • Follow me on Twitter: @MostafaElzoghbi • Want some help in building cloud solutions? Contact me to know more.

Hinweis der Redaktion

  1. Ref.: https://azure.microsoft.com/en-us/services/hdinsight/apache-spark/ Apache Spark leverages a common execution model for doing multiple tasks like ETL, batch queries, interactive queries, real-time streaming, machine learning, and graph processing on data stored in Azure Data Lake Store. This allows you to use Spark for Azure HDInsight to solve big data challenges in near real-time like fraud detection, click stream analysis, financial alerts, telemetry from connected sensors and devices (Internet of Things, IoT), social analytics, always-on ETL pipelines, and network monitoring.
  2. A) Main concepts to cover for Data Science: Regression Classification -- FOCUS Clustering Recommendation B) Building programmable components in Azure ML experiments C) Working with Azure ML studio
  3. Source: https://courses.edx.org/c4x/BerkeleyX/CS100.1x/asset/Week2Lec4.pdf Spark standalone running on two nodes with two workers: A client process submit an app to the master. The master instructs one of its workers to launch a driver. The worker spawns a driver JVM. The master instructs both works to launch executors for the app. The workers spawn executor JVMs. The driver and executors communicate independent of the cluster’s processes.
  4. Running Spark: Standalone cluster: Spark standalone comes out of the box. Comes with it is own web UI (monitor and run apps/jobs) Contains of master and worker (also called slave) Mesos and Yarn are also supported in Spark. Yarn is the only cluster manager on which spark can access HDFS secured with Kerberos. Yarn is the new generation of Hadoop’s MapReduce execution engine and can run MapReduce, Spark and other types of programs.
  5. For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing. Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
  6. Keep read-only variable cached on workers » Ship to each worker only once instead of with each task • Example: efficiently give every worker a large dataset • Usually distributed using efficient broadcast algorithms
  7. Extensively used in statistics Spark offers native support for: • Approximate and exact sampling • Approximate and exact stratified sampling Approximate sampling is faster and is good enough in most cases
  8. 1) Jupyter notebooks kernels with Apache Spark clusters in HDInsight https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-notebook-kernels 2) Ipython built in magics https://ipython.org/ipython-doc/3/interactive/magics.html#cell-magics Source for tipcs and magic keywords: https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
  9. Url: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
  10. Url: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
  11. Spark 2.0 announcements: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html