SlideShare ist ein Scribd-Unternehmen logo
1 von 79
Build COMPETENCY
across your TEAM
Apache Spark
On HDInsight
Chandrashekhar Deshpande
Module 1
Introduction to Apache Spark
Apache Spark Overview
• An Engine to process big data in faster(than MR), easy and extremely scalable way
• An Open Source, parallel, in-memory processing, cluster computing framework
• Solution for loading, processing and end to end analyzing large scale data
• Iterative and Interactive : Scala, Java, Python, R and with Command line interface
• Stream Processing (Real time streams and DStreams)
• Unifies Big Data with Batch processing, Streaming and Machine Learning
• Appreciated and Widely used by: Amazon, eBay, Yahoo
• Can very well go with Apache Kafka, ZeroMQ, Cassandra etc.
• Powerful platform to implement Lambda and Kappa Architecture
Spark Evolution
• Recent release of Spark is 2.3
• We will work on 2.1.1
Spark - Benefits
5
Performance
Using in-memory computing, Spark is
considerably faster than Hadoop (100x in
some tests).
Can be used for batch and real-time data
processing.
Developer Productivity
Easy-to-use APIs for processing large
datasets.
Includes 100+ operators for transforming.
Ecosystem
Spark has built-in support for many data
sources such as HDFS, RDBMS, S3, Apache
Hive, Cassandra and MongoDB.
Runs on top the Apache YARN resource
manager.
Unified Engine
Integrated framework includes higher-level
libraries for interactive SQL queries,
processing streaming data, machine learning
and graph processing.
A single application can combine all types of
processing
Spark is fast
6
Spark is the current (2014) Sort Benchmark winner.
3x faster than 2013 winner (Hadoop).
tinyurl.com/spark-sort
… especially for iterative applications
7Logistic regression on a 100-node cluster with 100 GB of data
Logistic Regression
140
120
100
80
40
20
0
60
Hadoop
Spark 0.9
Tuples of MR Vs. RDDs of Spark
Tuples
in Map
Reduce
RDDs in
Spark
Spark in-memory
• Spark does all intermediate steps in-memory…
• Faster in execution with fewer Secondary Storage r/w.
• Memory extensive
• The memory objects are called as Resilient Distributed Datasets.
• RDD’s are partitioned memory objects existing in multiple worker machines along with their
replicas.
A unified Framework
10
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark
Streaming
Stream processing
Spark MLlib
Machine
Learning
GraphX
Graph
Computation
Yarn Mesos
Standalone
Scheduler
Advantages of Unified Platform
11
Spark Streaming
Machine
Learning
Spark SQL
Spark in Lambda Architecture
Spark Framework
Prog. Inter
Library
Engine
Platforms
Storage
Scala
SQL
Spark Core
YARN
Local
Java
ML Lib
MESOS
HDFS
Python
Graph X
Scheduler
HBase
R
Streaming
Standalone
RDBMS NoSQL AWS S3
Azure Blob Storage / Data Lake Store
Spark Modes
• Batch mode: A scheduled program executed through scheduler in periodic
manner to process data.
• Interactive mode: Execute spark commands through Spark Interactive command
interface.
• The Shell provides default Spark Context and works as a Driver Program. The Spark context
runs tasks on Cluster.
• Stream mode: Process stream data in real time fashion.
Spark Scalability
Single Cluster, Stand-alone,
Single Box
• All components (Driver,
Executors) run within same
JVM.
• Partitions data for multiple
core.
• Runs as Single Threaded mode.
Managed Clustered
• Can scale from 2 to 1000
nodes.
• Can use different cluster
managers like- YARN, MESOS
etc.
• Partitions data for all nodes.
Area of applicability
• Data Integration and ETL
• Interactive Analytics
• High performance batch and micro-batch computations
• Advanced and complex analytics
• Machine learning
• Real time stream processing including IoT
• Example:
• Market trends and patterns
• Predicting sales
• Credit card frauds detection
• Network intrusion detection
• Advertisement targeting
• Customer’s 360 analysis
Spark – Use cases
17
Use case Description Users
Data Integration and
ETL
Cleansing and combining data from
diverse sources
Palantir: Data analytics platform
Interactive analytics
Gain insight from massive data sets tin
ad hoc investigations or regularly
planned dashboards.
Goldman Sachs: Analytics platform
Huawei: Query platform in the
telecom sector.
High performance
batch computation
Run complex algorithms against large
scale data
Novartis: Genomic Research
MyFitnessPal: Process food data
Machine Learning
Predict outcomes to make decisions
based on input data
Alibaba: Marketplace Analysis
Spotify: Music Recommendation
Real-time stream
processing
Capturing and processing data
continuously with low latency and high
reliability
Netflix: Recommendation Engine
British Gas: Connected Homes
Module 2
Sparks Architecture and Basics
Spark Architecture
• Spark Master: Manages number of applications. In
HDInsight, it also manages resources at cluster level.
• Spark Driver: Per application to manage workflow of an
application.
• Spark Context: Created by driver and keeps track and
metadata of RDDs. Gives API to exercise various
features of Spark.
• Worker Node: Read and write data from and to
HDFS/Storage.
Spark Execution components
• Driver Program
• It’s a main initiating program from where Spark operations are defined. It is executed in
Master Node.
• It controls and co-ordinates all operations.
• It defines RDDs.
• Each driver program execution is a ‘Job’.
Spark Execution components
• Spark Context
• Provides access to Spark functionalities.
• Represents connection to the computing cluster.
• Builds, partitions and distributes RDDs to clusters.
• Works with Cluster Manager
• Splits job as parallel task and executes them on worker nodes
• Collects and accumulates results and present them to the driver program
Resilient Distributed Datasets
• Operations with spark are mostly with RDDs. We create, transform, analyze and store RDDs in
Spark operations.
• RDDs are fast in access as stored in memory.
• Partitioned and Distributed as divided in parts and each part exist in a cluster.
• The data sets are formed of Strings, rows, objects, collection.
• They are immutable.
• To change, apply transformation and create new RDD.
• They can be cached and persisted.
• Actions produce summarized results.
Demo 1
TestSpark010_FirstProgram.java
Aim: This program basically demonstrates...
1. Configuring Spark context.
2. Using Resilient Distributed Datasets.
3. Introduction to Mappings, Filtering, Actions etc.
4. Creating String RDD from CSV text file.
5. Applying simple Map to convert strings to Upper case
6. Browser monitoring of Spark Context.
Spark architecture
Master
Node
Worker
Nodes
Module 3
Working with RDD’s and Paired RDDs in Spark
Loading data.
• RDD Loading sources…
• Text files
• JSON files
• Sequence files of HDFS
• Parallelize on collection
• Java Collection
• Python Lists
• R data frames
• RDBMS/NoSQL
• Use direct API
• Bring it first as Collection (DAO Classes) and then create RDD.
• Very large data sets
• Create HDFS out side spark and then create RDDs using Apache Sqoop.
• Spark aligns its own partitioning techniques to the partitioning of Hadoop.
Storing data
• Storing of RDD in variety of data sinks.
• Text files
• JSON
• Sequence Files
• Collection
• RDBMS/NoSQL
• For persistence
• Spark Utilities
• Language specific support
Spark power lies in processing data in distribute manner.
Though Spark provides API for Loading and sinking of data,
here where its real power does not lay. Out side capabilities
are recommended for simplicity and performance.
Lazy Evaluation
• Spark will not load or transform data unless action is encountered.
• Step 1: Load a file content into RDD.
• Step 2: Apply filtration
• Step 3: Count for number of records (Now Step 1 to 3 are executed).
• The above statement is true even for interactive mode.
• The lazy evaluation helps spark to optimize operations and manage resources in
better way.
• Makes trouble shooting difficult: Any problem in loading is detected while
executing Action.
Transformations
• Recall: RDDs are immutable.
• Transformation: Operation on RDD to create a new RDD.
• Examples: Maps, Flat map, Filter etc.
• Operate on one element at a time.
• Evaluates lazily.
• Distributed across multiple nodes and executed by Executor within a cluster
on local RDD independently.
• Creates its own subset of resultant RDD.
Transformations: Maps
JavaRDD<String> mutateRDD1 = autoAllData.map(function);
• Simulate Map Reduce of Hadoop.
• Element level computation or transformation.
• The result RDD may have same number of elements as source RDD.
• Result type may be different.
• In java it allows Lambda expressions or anonymous class.
• Scala/Python: Inline function, function reference allowed.
• Use cases:
• Data standardization of may be names
• Data type conversion i.e. from String to Custom object.
• Data Computation like tax calculations.
• Adding new attributes like calculating Grade.
• Data checking, cleansing etc.
Transformations: Filters
JavaRDD<String> mutateRDD1 = autoAllData.filter(function);
• From RDD, it selects element which passes given criterion.
• Result in RDD of smaller size than original RDD as some records getting eliminated.
• The filter() takes a function which returns Boolean value.
• In java it allows Lambda expressions (Predicate) or anonymous class.
• Scala/Python: Inline function, function reference allowed.
Actions
• Acts on entire RDD to reduce to a precise and consolidated result.
• Max/Min, Summerization
• Spark lazily evaluates all processing on encountering an Action.
• Simple Actions
• The collect() operation: Converts RDD into collection.
• The count() operation: Count number of elements of RDD.
• The first() operation: Returns first record as a string.
• The take(n) operation: Returns first ‘n’ elements as a List of Strings.
Module 4
Apache Spark on Azure HDInsight
Apache Spark on HDInsight
Azure Storage
Azure Data Lake Store
Hive and HBase
Azure Data Factory
Event Hub
S
P
A
R
K
Apache Kafka
Apache Flum
Orchestration
Spark Support on HDInsight
35
Feature Description
SLA 99.9% uptime
Ease of creating a
cluster
Possible using Azure Portal, Azure Powershell, Azure Insight .Net SDK.
Ease of use For interactive data processing and visualization, Jupyter and Zeppeline notebooks are provided.
REST APIs
Spark Cluster in HDInsight include Livy, a REST API based Spark Job Server to remotely submit and
monitor jobs.
Azure DataLake
The Azure DataLake Store can be used as primary storage (HDInsight 3.5 onwards) or an additional
storage.
Integration with
Azure services
Provides connectors for Azure Event Hub, Kafka.
R Server Can setup R Server and run R computations.
Concurrent
Queries
Supports concurrent queries. This enables multiple queries from one user or multiple queries from
various users and applications to share the same cluster resources.
SSD cache Cache of data either in memory or in SSD for better performance.
BI tools
integration
Connectors available for Power BI and Tableau for data analytics.
Machine Learning
Libraries
Preloaded 200 Anaconda libraries for Machine Learning, data analysis and visualization.
Scalability The Cloude’s prominent feature
Azure Storage for HDInsight
Azure Data Lake Store
37
• Enterprise-wide hyper-scale repository for big data analytic workloads.
• Can capture data of any size, type, and ingestion speed in one single place
for operational and exploratory analytics.
• Hadoop accesses Data Lake Store through Web-HDFS REST API.
• Is tuned for performance for data analytics scenarios.
• Supports all enterprise-grade capabilities—security, manageability,
scalability, reliability, and availability.
• Stores variety of data in native format without transformation. Can handle
structured, semi-structured, and unstructured data.
Azure Data Lake Store
Azure Storage Vs. Azure Data Lake
Azure Data Lake Store Azure Blob Storage
General purpose scalable object store Hyper-scale repository for big data analytics workload.
Use cases: Batch, interactive, streaming analytics and
machine learning data such as log files, IoT data, click
streams, large datasets
Any type of text or binary data, such as application back
end, backup data, media storage for streaming and
general purpose data.
Folders containing data in files Containers containing data in blobs.
Hierarchical file system Object store with flat namespace
Authentication: Azure AD identity Account Access Keys, Shared signature Access key
Optimized performance for parallel analytical workload Not optimized for analytics workload
Size Limit: No limit on file size or number of size. Specific limits as mentioned in document.
Geo-Redundancy: Locally redundant. Locally redundant, Globally redundant, Read Access
Globally Redundant.
Azure Data Factory
40
• A cloud data integration service, to compose data storage,
movement, and processing services into automated data
pipelines.
• Can handle ETL and complex hybrid ETL.
• Allows you to create data-driven workflows in the cloud for
orchestrating and automating data movement and data
transformation.
Azure Data Factory
Azure Event Hub
42
• A highly scalable ingestion system
• Can ingest millions of events per second, enabling an application to process
and analyze the massive amounts of data produced by your connected
devices and applications.
• Works as Event Ingestor which works as intermediate between even
publisher and event consumer.
• Decouples production of Even streams from consumption mechanism.
• Enables behavior tracking in mobile apps, traffic information from web
farms, in-game event capture in console games, or telemetry collected from
industrial machines, connected vehicles, or other devices.
Event Hub
Demo 3.1 : Spark cluster on HDInsight
44
Creating a HDInsight Spark Cluster
45
HDInsight Spark Dashboard
46
HDInsight Spark Resource Manager
47
The Resource Manager
enables you to control the
number of cores and amount
of memory allocated to Spark
cluster components and
notebooks.
Increasing the resources
allocated to the Thrift Server
can potentially improve the
performance with BI Tools
Resizing a HDInsight Spark Cluster
48
Notebooks: Jupyter and Zeppelin
49
HDInsight Spark: Jupyter Notebooks
50
A Jupyter notebook showing a Spark program in Python 2
HDInsight Spark: Zeppelin Notebooks
51
Zeppelin notebook must be connected to a
Spark cluster to run
A notebook ‘paragraph’ can be executed
by clicking this icon
Like Jupytr, Zeppelin enables interactive
charts and graphs to be easily included in
a notebook.
You can control the visualization using the
“Settings” drop-down menu
A number of charts and graphs are already
built into the Zeppelin notebook
The Visualization tools: PowerBI
Module 5
Apache Spark SQL and Data Frames
Overview
Spark SQL
• It’s a library built on Spark to support SQL Like operations.
• Facilitate eliminating RDDs from API for simplicity.
• Traditional RDBMS developers can easily transit to Big Data.
• Works with Structured Data that has a Schema.
• Seamlessly mix SQL Queries with Spark Programs.
• Supports JDBC.
• Mix with RDBMS and NoSQL.
Data Frames
Spark Session
• Like SparkContext for RDDs.
• Gives Data Frames and Temp Tables.
Data Frames
• RDDs are for Spark Core while Data Frames are for Spark SQL.
• Built upon RDDs
• It’s a distributed collection of data organized as Rows and Columns.
• Has a schema with column names and column types.
• Interoperability with…
• Collections, CSV, Data Bases, Hive/NoSQL tables, JSON, RDD etc.
Operations on Data Frames
• The filter: Its like ‘where’ clause.
• The join: Its like joins in SQL.
• The groupby: For grouping to get consolidation.
• The agg: Compute aggregation like sum, average.
• Allows mapping and reducing.
• Operations nesting allowed.
Spark SQL
• Spark SQL :
• Not intended for interactive/exploratory analysis.
• Spark SQL reuses the Hive frontend and meta-store.
• Gives full compatibility with existing Hive data, queries, and UDFs.
• Spark SQL includes a cost-based optimizer, columnar storage and code generation to make
queries fast.
• It scales to thousands of nodes and multi hour queries using the Spark engine. Performance
is its biggest advantage.
• Provides full mid-query fault tolerance.
Module 6
Apache Spark Streaming
What is Streaming?
Standing Queries
Query
Logic
Sources Targets
`
Devices, Sensors
Web servers
Pagers &
Monitoring devices
KPI Dashboards
Input
Adapters
Output
AdaptersStreaming Engine
Query
Logic
Query
Logic
Application at Runtime
ATM Security
Why Spark Streaming?
• One of the real powers of Spark
• Typically analytics is performed on data at rest:
• Databases, Flat files. The historical data, survey data etc.
• The real time analytics is performed on data the moment generated
• Complex Event Processing, Fraud detections, click stream processing etc.
• What Spark Stream can do?
• Look at data the moment it is arrived from source.
• Transform, summarize, analyze
• Perform machine learning
• Prediction in real time
Spark Streaming and MicroBatch Processing
Spark Streaming
Credit card fraud detection with high scalability and parallelism.
Spam filtering
Network intrusion detection
Real time social media analytics
Click Stream analytics
Stock market analysis
Advertise analytics
Spark Streaming architecture
Master Node
Driver Program
Spark Context
Stream
Context
Cluster
Manager
Worker Node
Executor
Long
Task Receiver
Input Source
Worker Node
Worker Node
Executor
Tas
k
Cache
Spark Streaming architecture
A master node is with driver program with Spark Context.
Create a streaming context from Spark Context.
One of the worker node is assigned a long task of listening a source.
The receiver keeps receiving data from input source.
The receiver propagate data to worker nodes.
The normal tasks in worker node act upon data.
The DStream
• Discretized stream
• Created from Stream context
• The micro batch window is set up for Dstream (Normally in secords)
• The micro batch window is a small time slice (around 3 sec) in which
generated real time data is accumulated as batch and wrapped in RDD called
Dstream.
• The Dstream allows all RDD operations.
• A common data can be shared across Global Variables
The Dstream windowing functions
• They are for computing across multiple Dstreams.
• All RDD functions are applicable on data accumulated from last X
batches.
• Ex: Accumulate last 3 batches together, Average of something of last 5
batches.
Windowing in Spark
68
1 2 3 4 5 6 7 8 9 10 11 12
First 8 sec Window Second 8 sec Window
Windowing Sample Code
69
Callback OperationWindow Length Sliding IntervalWindow Operation
“Exactly once”, “At Least Once” Guarantees
70
Streaming with Azure Event Hub
71
Azure Event Hub HDInsight Spark Streaming Power BI
Module 7
Apache Spark : Analytics and Machine Learning
Types of Analytics
Descriptive Analytics :
• Defining problem
statement. What exactly
happened.
Exploratory Data
Analytics:
• Why something is
happening.
Inferential Analytics:
• Understand population
from the sample. Take
sample and extrapolate to
whole population.
Predictive Analytics:
• Forecast what will
happen.
Causal Analytics:
• Variables are related.
Understanding effect of
change in one variable to
another variable.
Deep Analytics:
• Analytics uses multi-
source data sources,
combining some or all
above analytics.
Data Analytics is needed everywhere –
Recommendation
engines
Smart meter
monitoringEquipment monitoringAdvertising analysisLife sciences research
Fraud
detection
Healthcare outcomes
Weather forecasting
for business planningOil & Gas exploration
Social network analysis
Churn
analysis
Traffic flow
optimization
IT infrastructure &
Web App optimization
Legal
discovery and
document archivingIntelligence Gathering
Location-based
tracking & services
Pricing Analysis
Personalized Insurance
Machine Learning in Sparks
• Makes ML easy
• Standard and common interface for different ML algorithms.
• Contains algorithms and utilities.
• It has two machine learning libraries.
• The spark.mllib: Original API built on RDDs. May be deprecated soon.
• The spark.ml: New higher level API built on Data Frames.
• The Machine Learning Algorithms of Spark uses these data types…
• Local Vector
• Labeled Point
• Every data to be submitted to ML must be converted to these data types.
Other algorithms supported
• Decision Tree
• Dimensionality Reduction
• Random Forest
• Linear Regression
• Naïve Bayes Classification
• K-Means Clustering
• Recommendation Engines
• …. And many more
Q & A
Contact: chandrashekhardeshpande@synergetics-india.com,
maheshshinde@synergetics-india.com
References
• Online reference:
• http://spark.apache.org/docs/latest/index.html
• http://spark.apache.org/docs/latest/programming-
guide.html
• http://spark.apache.org/docs/latest/api/java/index.html
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoopinside-BigData.com
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at ViadeoCepoi Eugen
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemBojan Babic
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark Summit
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualizationhadoopsphere
 

Was ist angesagt? (20)

Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Data Science
Data ScienceData Science
Data Science
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 

Ähnlich wie Apache Spark on HDinsight Training

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 

Ähnlich wie Apache Spark on HDinsight Training (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache spark
Apache sparkApache spark
Apache spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark core
Spark coreSpark core
Spark core
 
Spark
SparkSpark
Spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 

Mehr von Synergetics Learning and Cloud Consulting

Introduction to Containers & Diving a little deeper into the benefits of Con...
 Introduction to Containers & Diving a little deeper into the benefits of Con... Introduction to Containers & Diving a little deeper into the benefits of Con...
Introduction to Containers & Diving a little deeper into the benefits of Con...Synergetics Learning and Cloud Consulting
 

Mehr von Synergetics Learning and Cloud Consulting (20)

Introduction to Containers & Diving a little deeper into the benefits of Con...
 Introduction to Containers & Diving a little deeper into the benefits of Con... Introduction to Containers & Diving a little deeper into the benefits of Con...
Introduction to Containers & Diving a little deeper into the benefits of Con...
 
Monitor Cloud Resources using Alerts & Insights
Monitor Cloud Resources using Alerts & InsightsMonitor Cloud Resources using Alerts & Insights
Monitor Cloud Resources using Alerts & Insights
 
Implementing governance in the cloud era
Implementing governance in the cloud eraImplementing governance in the cloud era
Implementing governance in the cloud era
 
Past, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps InfrastructurePast, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps Infrastructure
 
The social employee
The social employeeThe social employee
The social employee
 
Microsoft Azure New Certification Training roadmap
Microsoft Azure New Certification Training roadmapMicrosoft Azure New Certification Training roadmap
Microsoft Azure New Certification Training roadmap
 
Synergetics Microsoft engagement work
Synergetics Microsoft engagement workSynergetics Microsoft engagement work
Synergetics Microsoft engagement work
 
Deep architectural competency for deploying azure solutions
Deep architectural competency for deploying azure solutionsDeep architectural competency for deploying azure solutions
Deep architectural competency for deploying azure solutions
 
Pre sales engineer
Pre sales engineerPre sales engineer
Pre sales engineer
 
Synergetics Re-skilling pitch deck
Synergetics Re-skilling pitch deckSynergetics Re-skilling pitch deck
Synergetics Re-skilling pitch deck
 
Synergetics On boarding pitch deck
Synergetics On boarding pitch deckSynergetics On boarding pitch deck
Synergetics On boarding pitch deck
 
Dev ops using Jenkins
Dev ops using JenkinsDev ops using Jenkins
Dev ops using Jenkins
 
Thank you global azure boot camp 2018, mumbai
Thank you global azure boot camp 2018, mumbaiThank you global azure boot camp 2018, mumbai
Thank you global azure boot camp 2018, mumbai
 
Synergetics Digital Transformation Note
Synergetics Digital Transformation NoteSynergetics Digital Transformation Note
Synergetics Digital Transformation Note
 
Synergetics digital transformation
Synergetics digital transformationSynergetics digital transformation
Synergetics digital transformation
 
Synergetics India Corporate Presentation
Synergetics India Corporate PresentationSynergetics India Corporate Presentation
Synergetics India Corporate Presentation
 
Synergetics Consulting project details
Synergetics Consulting  project detailsSynergetics Consulting  project details
Synergetics Consulting project details
 
Core synergetics presentation 2015-16
Core synergetics presentation 2015-16Core synergetics presentation 2015-16
Core synergetics presentation 2015-16
 
Asap session 2
Asap session 2Asap session 2
Asap session 2
 
Asap session 1
Asap session 1Asap session 1
Asap session 1
 

Kürzlich hochgeladen

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Kürzlich hochgeladen (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Apache Spark on HDinsight Training

  • 1. Build COMPETENCY across your TEAM Apache Spark On HDInsight Chandrashekhar Deshpande
  • 3. Apache Spark Overview • An Engine to process big data in faster(than MR), easy and extremely scalable way • An Open Source, parallel, in-memory processing, cluster computing framework • Solution for loading, processing and end to end analyzing large scale data • Iterative and Interactive : Scala, Java, Python, R and with Command line interface • Stream Processing (Real time streams and DStreams) • Unifies Big Data with Batch processing, Streaming and Machine Learning • Appreciated and Widely used by: Amazon, eBay, Yahoo • Can very well go with Apache Kafka, ZeroMQ, Cassandra etc. • Powerful platform to implement Lambda and Kappa Architecture
  • 4. Spark Evolution • Recent release of Spark is 2.3 • We will work on 2.1.1
  • 5. Spark - Benefits 5 Performance Using in-memory computing, Spark is considerably faster than Hadoop (100x in some tests). Can be used for batch and real-time data processing. Developer Productivity Easy-to-use APIs for processing large datasets. Includes 100+ operators for transforming. Ecosystem Spark has built-in support for many data sources such as HDFS, RDBMS, S3, Apache Hive, Cassandra and MongoDB. Runs on top the Apache YARN resource manager. Unified Engine Integrated framework includes higher-level libraries for interactive SQL queries, processing streaming data, machine learning and graph processing. A single application can combine all types of processing
  • 6. Spark is fast 6 Spark is the current (2014) Sort Benchmark winner. 3x faster than 2013 winner (Hadoop). tinyurl.com/spark-sort
  • 7. … especially for iterative applications 7Logistic regression on a 100-node cluster with 100 GB of data Logistic Regression 140 120 100 80 40 20 0 60 Hadoop Spark 0.9
  • 8. Tuples of MR Vs. RDDs of Spark Tuples in Map Reduce RDDs in Spark
  • 9. Spark in-memory • Spark does all intermediate steps in-memory… • Faster in execution with fewer Secondary Storage r/w. • Memory extensive • The memory objects are called as Resilient Distributed Datasets. • RDD’s are partitioned memory objects existing in multiple worker machines along with their replicas.
  • 10. A unified Framework 10 An unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation Yarn Mesos Standalone Scheduler
  • 11. Advantages of Unified Platform 11 Spark Streaming Machine Learning Spark SQL
  • 12. Spark in Lambda Architecture
  • 13. Spark Framework Prog. Inter Library Engine Platforms Storage Scala SQL Spark Core YARN Local Java ML Lib MESOS HDFS Python Graph X Scheduler HBase R Streaming Standalone RDBMS NoSQL AWS S3 Azure Blob Storage / Data Lake Store
  • 14. Spark Modes • Batch mode: A scheduled program executed through scheduler in periodic manner to process data. • Interactive mode: Execute spark commands through Spark Interactive command interface. • The Shell provides default Spark Context and works as a Driver Program. The Spark context runs tasks on Cluster. • Stream mode: Process stream data in real time fashion.
  • 15. Spark Scalability Single Cluster, Stand-alone, Single Box • All components (Driver, Executors) run within same JVM. • Partitions data for multiple core. • Runs as Single Threaded mode. Managed Clustered • Can scale from 2 to 1000 nodes. • Can use different cluster managers like- YARN, MESOS etc. • Partitions data for all nodes.
  • 16. Area of applicability • Data Integration and ETL • Interactive Analytics • High performance batch and micro-batch computations • Advanced and complex analytics • Machine learning • Real time stream processing including IoT • Example: • Market trends and patterns • Predicting sales • Credit card frauds detection • Network intrusion detection • Advertisement targeting • Customer’s 360 analysis
  • 17. Spark – Use cases 17 Use case Description Users Data Integration and ETL Cleansing and combining data from diverse sources Palantir: Data analytics platform Interactive analytics Gain insight from massive data sets tin ad hoc investigations or regularly planned dashboards. Goldman Sachs: Analytics platform Huawei: Query platform in the telecom sector. High performance batch computation Run complex algorithms against large scale data Novartis: Genomic Research MyFitnessPal: Process food data Machine Learning Predict outcomes to make decisions based on input data Alibaba: Marketplace Analysis Spotify: Music Recommendation Real-time stream processing Capturing and processing data continuously with low latency and high reliability Netflix: Recommendation Engine British Gas: Connected Homes
  • 19. Spark Architecture • Spark Master: Manages number of applications. In HDInsight, it also manages resources at cluster level. • Spark Driver: Per application to manage workflow of an application. • Spark Context: Created by driver and keeps track and metadata of RDDs. Gives API to exercise various features of Spark. • Worker Node: Read and write data from and to HDFS/Storage.
  • 20. Spark Execution components • Driver Program • It’s a main initiating program from where Spark operations are defined. It is executed in Master Node. • It controls and co-ordinates all operations. • It defines RDDs. • Each driver program execution is a ‘Job’.
  • 21. Spark Execution components • Spark Context • Provides access to Spark functionalities. • Represents connection to the computing cluster. • Builds, partitions and distributes RDDs to clusters. • Works with Cluster Manager • Splits job as parallel task and executes them on worker nodes • Collects and accumulates results and present them to the driver program
  • 22. Resilient Distributed Datasets • Operations with spark are mostly with RDDs. We create, transform, analyze and store RDDs in Spark operations. • RDDs are fast in access as stored in memory. • Partitioned and Distributed as divided in parts and each part exist in a cluster. • The data sets are formed of Strings, rows, objects, collection. • They are immutable. • To change, apply transformation and create new RDD. • They can be cached and persisted. • Actions produce summarized results.
  • 23. Demo 1 TestSpark010_FirstProgram.java Aim: This program basically demonstrates... 1. Configuring Spark context. 2. Using Resilient Distributed Datasets. 3. Introduction to Mappings, Filtering, Actions etc. 4. Creating String RDD from CSV text file. 5. Applying simple Map to convert strings to Upper case 6. Browser monitoring of Spark Context.
  • 25. Module 3 Working with RDD’s and Paired RDDs in Spark
  • 26. Loading data. • RDD Loading sources… • Text files • JSON files • Sequence files of HDFS • Parallelize on collection • Java Collection • Python Lists • R data frames • RDBMS/NoSQL • Use direct API • Bring it first as Collection (DAO Classes) and then create RDD. • Very large data sets • Create HDFS out side spark and then create RDDs using Apache Sqoop. • Spark aligns its own partitioning techniques to the partitioning of Hadoop.
  • 27. Storing data • Storing of RDD in variety of data sinks. • Text files • JSON • Sequence Files • Collection • RDBMS/NoSQL • For persistence • Spark Utilities • Language specific support Spark power lies in processing data in distribute manner. Though Spark provides API for Loading and sinking of data, here where its real power does not lay. Out side capabilities are recommended for simplicity and performance.
  • 28. Lazy Evaluation • Spark will not load or transform data unless action is encountered. • Step 1: Load a file content into RDD. • Step 2: Apply filtration • Step 3: Count for number of records (Now Step 1 to 3 are executed). • The above statement is true even for interactive mode. • The lazy evaluation helps spark to optimize operations and manage resources in better way. • Makes trouble shooting difficult: Any problem in loading is detected while executing Action.
  • 29. Transformations • Recall: RDDs are immutable. • Transformation: Operation on RDD to create a new RDD. • Examples: Maps, Flat map, Filter etc. • Operate on one element at a time. • Evaluates lazily. • Distributed across multiple nodes and executed by Executor within a cluster on local RDD independently. • Creates its own subset of resultant RDD.
  • 30. Transformations: Maps JavaRDD<String> mutateRDD1 = autoAllData.map(function); • Simulate Map Reduce of Hadoop. • Element level computation or transformation. • The result RDD may have same number of elements as source RDD. • Result type may be different. • In java it allows Lambda expressions or anonymous class. • Scala/Python: Inline function, function reference allowed. • Use cases: • Data standardization of may be names • Data type conversion i.e. from String to Custom object. • Data Computation like tax calculations. • Adding new attributes like calculating Grade. • Data checking, cleansing etc.
  • 31. Transformations: Filters JavaRDD<String> mutateRDD1 = autoAllData.filter(function); • From RDD, it selects element which passes given criterion. • Result in RDD of smaller size than original RDD as some records getting eliminated. • The filter() takes a function which returns Boolean value. • In java it allows Lambda expressions (Predicate) or anonymous class. • Scala/Python: Inline function, function reference allowed.
  • 32. Actions • Acts on entire RDD to reduce to a precise and consolidated result. • Max/Min, Summerization • Spark lazily evaluates all processing on encountering an Action. • Simple Actions • The collect() operation: Converts RDD into collection. • The count() operation: Count number of elements of RDD. • The first() operation: Returns first record as a string. • The take(n) operation: Returns first ‘n’ elements as a List of Strings.
  • 33. Module 4 Apache Spark on Azure HDInsight
  • 34. Apache Spark on HDInsight Azure Storage Azure Data Lake Store Hive and HBase Azure Data Factory Event Hub S P A R K Apache Kafka Apache Flum Orchestration
  • 35. Spark Support on HDInsight 35 Feature Description SLA 99.9% uptime Ease of creating a cluster Possible using Azure Portal, Azure Powershell, Azure Insight .Net SDK. Ease of use For interactive data processing and visualization, Jupyter and Zeppeline notebooks are provided. REST APIs Spark Cluster in HDInsight include Livy, a REST API based Spark Job Server to remotely submit and monitor jobs. Azure DataLake The Azure DataLake Store can be used as primary storage (HDInsight 3.5 onwards) or an additional storage. Integration with Azure services Provides connectors for Azure Event Hub, Kafka. R Server Can setup R Server and run R computations. Concurrent Queries Supports concurrent queries. This enables multiple queries from one user or multiple queries from various users and applications to share the same cluster resources. SSD cache Cache of data either in memory or in SSD for better performance. BI tools integration Connectors available for Power BI and Tableau for data analytics. Machine Learning Libraries Preloaded 200 Anaconda libraries for Machine Learning, data analysis and visualization. Scalability The Cloude’s prominent feature
  • 36. Azure Storage for HDInsight
  • 37. Azure Data Lake Store 37 • Enterprise-wide hyper-scale repository for big data analytic workloads. • Can capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics. • Hadoop accesses Data Lake Store through Web-HDFS REST API. • Is tuned for performance for data analytics scenarios. • Supports all enterprise-grade capabilities—security, manageability, scalability, reliability, and availability. • Stores variety of data in native format without transformation. Can handle structured, semi-structured, and unstructured data.
  • 39. Azure Storage Vs. Azure Data Lake Azure Data Lake Store Azure Blob Storage General purpose scalable object store Hyper-scale repository for big data analytics workload. Use cases: Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets Any type of text or binary data, such as application back end, backup data, media storage for streaming and general purpose data. Folders containing data in files Containers containing data in blobs. Hierarchical file system Object store with flat namespace Authentication: Azure AD identity Account Access Keys, Shared signature Access key Optimized performance for parallel analytical workload Not optimized for analytics workload Size Limit: No limit on file size or number of size. Specific limits as mentioned in document. Geo-Redundancy: Locally redundant. Locally redundant, Globally redundant, Read Access Globally Redundant.
  • 40. Azure Data Factory 40 • A cloud data integration service, to compose data storage, movement, and processing services into automated data pipelines. • Can handle ETL and complex hybrid ETL. • Allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
  • 42. Azure Event Hub 42 • A highly scalable ingestion system • Can ingest millions of events per second, enabling an application to process and analyze the massive amounts of data produced by your connected devices and applications. • Works as Event Ingestor which works as intermediate between even publisher and event consumer. • Decouples production of Even streams from consumption mechanism. • Enables behavior tracking in mobile apps, traffic information from web farms, in-game event capture in console games, or telemetry collected from industrial machines, connected vehicles, or other devices.
  • 44. Demo 3.1 : Spark cluster on HDInsight 44
  • 45. Creating a HDInsight Spark Cluster 45
  • 47. HDInsight Spark Resource Manager 47 The Resource Manager enables you to control the number of cores and amount of memory allocated to Spark cluster components and notebooks. Increasing the resources allocated to the Thrift Server can potentially improve the performance with BI Tools
  • 48. Resizing a HDInsight Spark Cluster 48
  • 49. Notebooks: Jupyter and Zeppelin 49
  • 50. HDInsight Spark: Jupyter Notebooks 50 A Jupyter notebook showing a Spark program in Python 2
  • 51. HDInsight Spark: Zeppelin Notebooks 51 Zeppelin notebook must be connected to a Spark cluster to run A notebook ‘paragraph’ can be executed by clicking this icon Like Jupytr, Zeppelin enables interactive charts and graphs to be easily included in a notebook. You can control the visualization using the “Settings” drop-down menu A number of charts and graphs are already built into the Zeppelin notebook
  • 53. Module 5 Apache Spark SQL and Data Frames
  • 54. Overview Spark SQL • It’s a library built on Spark to support SQL Like operations. • Facilitate eliminating RDDs from API for simplicity. • Traditional RDBMS developers can easily transit to Big Data. • Works with Structured Data that has a Schema. • Seamlessly mix SQL Queries with Spark Programs. • Supports JDBC. • Mix with RDBMS and NoSQL.
  • 55. Data Frames Spark Session • Like SparkContext for RDDs. • Gives Data Frames and Temp Tables. Data Frames • RDDs are for Spark Core while Data Frames are for Spark SQL. • Built upon RDDs • It’s a distributed collection of data organized as Rows and Columns. • Has a schema with column names and column types. • Interoperability with… • Collections, CSV, Data Bases, Hive/NoSQL tables, JSON, RDD etc.
  • 56. Operations on Data Frames • The filter: Its like ‘where’ clause. • The join: Its like joins in SQL. • The groupby: For grouping to get consolidation. • The agg: Compute aggregation like sum, average. • Allows mapping and reducing. • Operations nesting allowed.
  • 57. Spark SQL • Spark SQL : • Not intended for interactive/exploratory analysis. • Spark SQL reuses the Hive frontend and meta-store. • Gives full compatibility with existing Hive data, queries, and UDFs. • Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast. • It scales to thousands of nodes and multi hour queries using the Spark engine. Performance is its biggest advantage. • Provides full mid-query fault tolerance.
  • 59. What is Streaming? Standing Queries Query Logic Sources Targets ` Devices, Sensors Web servers Pagers & Monitoring devices KPI Dashboards Input Adapters Output AdaptersStreaming Engine Query Logic Query Logic Application at Runtime
  • 61. Why Spark Streaming? • One of the real powers of Spark • Typically analytics is performed on data at rest: • Databases, Flat files. The historical data, survey data etc. • The real time analytics is performed on data the moment generated • Complex Event Processing, Fraud detections, click stream processing etc. • What Spark Stream can do? • Look at data the moment it is arrived from source. • Transform, summarize, analyze • Perform machine learning • Prediction in real time
  • 62. Spark Streaming and MicroBatch Processing
  • 63. Spark Streaming Credit card fraud detection with high scalability and parallelism. Spam filtering Network intrusion detection Real time social media analytics Click Stream analytics Stock market analysis Advertise analytics
  • 64. Spark Streaming architecture Master Node Driver Program Spark Context Stream Context Cluster Manager Worker Node Executor Long Task Receiver Input Source Worker Node Worker Node Executor Tas k Cache
  • 65. Spark Streaming architecture A master node is with driver program with Spark Context. Create a streaming context from Spark Context. One of the worker node is assigned a long task of listening a source. The receiver keeps receiving data from input source. The receiver propagate data to worker nodes. The normal tasks in worker node act upon data.
  • 66. The DStream • Discretized stream • Created from Stream context • The micro batch window is set up for Dstream (Normally in secords) • The micro batch window is a small time slice (around 3 sec) in which generated real time data is accumulated as batch and wrapped in RDD called Dstream. • The Dstream allows all RDD operations. • A common data can be shared across Global Variables
  • 67. The Dstream windowing functions • They are for computing across multiple Dstreams. • All RDD functions are applicable on data accumulated from last X batches. • Ex: Accumulate last 3 batches together, Average of something of last 5 batches.
  • 68. Windowing in Spark 68 1 2 3 4 5 6 7 8 9 10 11 12 First 8 sec Window Second 8 sec Window
  • 69. Windowing Sample Code 69 Callback OperationWindow Length Sliding IntervalWindow Operation
  • 70. “Exactly once”, “At Least Once” Guarantees 70
  • 71. Streaming with Azure Event Hub 71 Azure Event Hub HDInsight Spark Streaming Power BI
  • 72. Module 7 Apache Spark : Analytics and Machine Learning
  • 73. Types of Analytics Descriptive Analytics : • Defining problem statement. What exactly happened. Exploratory Data Analytics: • Why something is happening. Inferential Analytics: • Understand population from the sample. Take sample and extrapolate to whole population. Predictive Analytics: • Forecast what will happen. Causal Analytics: • Variables are related. Understanding effect of change in one variable to another variable. Deep Analytics: • Analytics uses multi- source data sources, combining some or all above analytics.
  • 74. Data Analytics is needed everywhere – Recommendation engines Smart meter monitoringEquipment monitoringAdvertising analysisLife sciences research Fraud detection Healthcare outcomes Weather forecasting for business planningOil & Gas exploration Social network analysis Churn analysis Traffic flow optimization IT infrastructure & Web App optimization Legal discovery and document archivingIntelligence Gathering Location-based tracking & services Pricing Analysis Personalized Insurance
  • 75. Machine Learning in Sparks • Makes ML easy • Standard and common interface for different ML algorithms. • Contains algorithms and utilities. • It has two machine learning libraries. • The spark.mllib: Original API built on RDDs. May be deprecated soon. • The spark.ml: New higher level API built on Data Frames. • The Machine Learning Algorithms of Spark uses these data types… • Local Vector • Labeled Point • Every data to be submitted to ML must be converted to these data types.
  • 76. Other algorithms supported • Decision Tree • Dimensionality Reduction • Random Forest • Linear Regression • Naïve Bayes Classification • K-Means Clustering • Recommendation Engines • …. And many more
  • 77. Q & A Contact: chandrashekhardeshpande@synergetics-india.com, maheshshinde@synergetics-india.com
  • 78. References • Online reference: • http://spark.apache.org/docs/latest/index.html • http://spark.apache.org/docs/latest/programming- guide.html • http://spark.apache.org/docs/latest/api/java/index.html

Hinweis der Redaktion

  1. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  2. points = spark.textFile(...).map(parsePoint).cache() w = numpy.random.ranf(size = D) # current separating plane for i in range(ITERATIONS):     gradient = points.map(         lambda p: (1 / (1 + exp(-p.y*(w.dot(p.x)))) - 1) * p.y * p.x     ).reduce(lambda a, b: a + b)     w -= gradient print "Final separating plane: %s" % w
  3. The property graph is a directed multigraph which can have multiple edges in parallel. Every edge and vertex has user defined properties associated with it. The parallel edges allow multiple relationships between the same vertices. Calculating shortest path between two airports. Calculating cheapest travel between two stations etc.
  4. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  5. Master node: Driver Program: A program you write to initiate process. Spark Context: A gate way to all spark functionalities. Worker node: They work as per instruction from Master node. It executes Executor Programs. It is controlled by Cluster manager. The Cluster Manager may be YARN, MESOS or Spark Scheduler. How RDDs are created? The Spark Context reads the records from Data Source and hands they over to Cluster Manager. Cluster Manager partitioned and distribute them to different worker nodes. How transformation is applied to RDDs? The spark context delegate transformation (Job) through cluster manager to the Executors. They are now called as Tasks and are executed in Executors. Executors may create new RDDs as outcome of execution of Transformation. All these RDDs are collected in Master Spark Context.
  6. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  7. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  8. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  9. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing
  10. Credit card fraud detection: Every time credit card is swiped, a system must within fraction of seconds to analyze for abnormality situation to prevent credit card access from frauds by blocking. Spam filtering: Many mails may hit the email box in unit time. It is essential to apply some parameters to know is it a spam mail. Social media analytics: The data arising through social media need to be analyzed in real time to quickly arrive to some urgent conclusion. Network intrusion detection: To prevent hacking into system by quickly analyzing web logs and system logs. Click stream analysis: When internet user clicks through or browse through web pages, through analytics system need to give some recommendations Stock Market analysis: Very frequent fluctuations in stock market leads to analysis to anticipate and draw some conclusion. Advertise analysis: When search string is given, system needs to quickly show different advertises related to the search string.
  11. 3 Slides - 5-10 Minute Discussion – Good to understand customers cloud strategy and see if S+S and Leveraging Existing Investments is important to them. Find out what cloud vendors and solutions they are evaluating. For more details on the cloud definitions see wikipedia http://en.wikipedia.org/wiki/Cloud_computing