Keine Notizen für die Folie
Apache Spark leverages a common execution model for doing multiple tasks like ETL, batch queries, interactive queries, real-time streaming, machine learning, and graph processing on data stored in Azure Data Lake Store. This allows you to use Spark for Azure HDInsight to solve big data challenges in near real-time like fraud detection, click stream analysis, financial alerts, telemetry from connected sensors and devices (Internet of Things, IoT), social analytics, always-on ETL pipelines, and network monitoring.
A) Main concepts to cover for Data Science:
Classification -- FOCUS
B) Building programmable components in Azure ML experiments
C) Working with Azure ML studio
Spark standalone running on two nodes with two workers:
A client process submit an app to the master.
The master instructs one of its workers to launch a driver.
The worker spawns a driver JVM.
The master instructs both works to launch executors for the app.
The workers spawn executor JVMs.
The driver and executors communicate independent of the cluster’s processes.
Standalone cluster: Spark standalone comes out of the box. Comes with it is own web UI (monitor and run apps/jobs)
Contains of master and worker (also called slave)
Mesos and Yarn are also supported in Spark.
Yarn is the only cluster manager on which spark can access HDFS secured with Kerberos.
Yarn is the new generation of Hadoop’s MapReduce execution engine and can run MapReduce, Spark and other types of programs.
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Keep read-only variable cached on workers
» Ship to each worker only once instead of with each task
• Example: efficiently give every worker a large dataset
• Usually distributed using efficient broadcast algorithms
Extensively used in statistics
Spark offers native support for:
• Approximate and exact sampling
• Approximate and exact stratified sampling
Approximate sampling is faster
and is good enough in most cases
1) Jupyter notebooks kernels with Apache Spark clusters in HDInsight
2) Ipython built in magics
Source for tipcs and magic keywords:
Spark 2.0 announcements: