SlideShare a Scribd company logo
1 of 27
Apache Spark
Lightening Fast Cluster Computing
Eric Mizell – Director, Solution Engineering
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Apache Spark?
Apache Open Source Project
Distributed Compute Engine
for fast and expressive data processing
Designed for Iterative, In-Memory
computations and interactive data mining
Expressive Multi-Language APIs
for Java, Scala, Python, and R
Powerful Abstractions
Enable data workers to rapidly iterate over
data for:
• ETL, Machine Learning, SQL, Stream Processing,
and Graph Processing
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Why Spark?
Elegant Developer APIs
• Data Frames/SQL, Machine Learning, Graph algorithms and streaming
• Scala, Python, Java and R
• Single environment for pre-processing and Machine Learning
In-memory computation model
• Effective for iterative computations and machine learning
Machine Learning On Hadoop
• Implementation of distributed ML-algorithms
• Pipeline API (Spark ML)
Runs on Hadoop on YARN, Mesos, standalone
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Interactions with Spark
Command Line
• Scala shell – Scala/Java (./bin/spark-shell)
• Python - (./bin/pyspark)
Notebooks
• Apache Zeppelin Notebook
• Juptyer/IPython Notebook
• IRuby Notebook
ODBC/JDBC (Spark SQL only via Thrift)
• Simba driver
• DataDirect driver
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introducing Apache Zeppelin Web-based Notebook for
interactive analytics
Features
Ad-hoc experimentation
Deeply integrated with Spark + Hadoop
Supports multiple language backends
Incubating at Apache
Use Case
Data exploration and discovery
Visualization
Interactive snippet-at-a-time experience
“Modern Data Science Studio”
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fundamental Abstraction: Resilient Distributed Datasets
RDD
Work with distributed collections as
primitives
RDD Properties
• Immutable collections of objects spread across
a cluster
• Built through parallel transformations (map,
filter, etc.)
• Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
Multiple Languages
broad developer, partner and customer
engagement
RDD
Partition 1
RDD
Partition 2
RDD
Partition 3Worker Node
Worker Node
Worker Node
RDD
LogicalSpark
Driver
sc = new SparkContext
rDD
=sc.textfile(“hdfs://…”)
rDD.filter(…)
rDD.Cache
rDD.Count
rDD.map
…
Developer
Physical
Writes
RDD
RDDs are collections of objects distributed across a cluster,
cached in RAM or on disk. They are built through parallel
transformations, automatically rebuilt on failure and immutable
(each transformation creates a new RDD).
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What can developers do with RDDs?
RDD Operations
Transformations
• e.g. map, filter, groupBy, join
• Lazy operations to build RDDs from other
RDDs
Actions
• e.g. count, collect, save
• Return a result or write it to storage
Other primitives
• Accumulator
• Broadcast Variables
Developer
Writes
RDD
Operations
Writes
Accumulator
s
Actions
Broadcast
Variables
Transformations
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘t’)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in <1 sec
(vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
Example: Mining Console Logs
Load error messages from a log into memory, then
interactively search for patterns
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
RDD
Demo
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SQL
SQL Access and Data Frames
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
YARN
HDFS
Spark SQL
Table Structure
integrated to work with tables and rows
Hive Queries via Spark
by Spark SQL Context can connect to Hive and
query Hive
Bindings
to Python, Scala, Java, and R
Data Frames
new abstractions simplifies and speeds up SQL
processing
Spark Core Engine
Spark SQL
Data Frame DSL Spark SQL
Data Frame API
Data Source API
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storage
What are Data Frames?
Data Frames represent data in RDDs as a Table
RDD is a low level abstraction
–Think of RDD as bytecode and DataFrame as the
Java Program
Data Frame Properties
–Data Frames attach schema to RDDs
–Allows users to perform aggressive query
optimizations
–Brings the power of SQL to RDDs!
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Tuple
Relational
View
Columnar Storage
ORCFile Parquet
Unstructured Data
JSON CSV
Text Avro
Custom
Weblog
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Data Frames are intuitive
RDD Example
Equivalent Data Frame Example
dept name age
Bio H Smith 48
CS A Turing 54
Bio B Jones 43
Phys E Witten 61
Find average age by department?
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
DataFrame
Demo
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
Streaming
MLlib
Spark
SQL
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
MLlib
Machine Learning Library
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
Spark
Streaming
MLlib
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
What is Machine Learning?
Machine learning is the study of
algorithms that learn concepts from
data.
A key aspect of learning is
generalization: how well a learning
algorithm is able to predict on unseen
examples.
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Machine Learning Primitives
Unsupervised Learning
Clustering (K-means)
Recommendation
Collaborative Filtering
- alternating least squares
Dimensionality Reductions
- Principal component analysis (PCA) and singular
value decomposition (SVD)
Supervised Learning
Classification
- Naïve Bayes, Decision Tree, Random Forest,
Gradient Boosted Trees
Regression
- linear, logistic and Support Vector Machines
(SVMs)
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Workflows are complex
Q-Q
Q-A
similarit
y
Log
Parsing,
Cleanin
g
Ad
category
mapping
Query
category
mapping
Poly
Exp
(Q-A)
Feature
s
Model
Linear
Solver
train
test
Metrics
• Feature Extraction
Feature
Extraction
Ad Server
Sponsored Search Advertising Pipeline Challenges:
-> specify pipeline
-> inspect and debug
-> tune hyperparameters
-> productionize
HDFS
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ML Pipeline makes ML workflows easier
Transformer
Transforms one dataset into another
Estimator
Fits model to data
Pipeline
Sequence of stages, consisting of estimators
or transformers
Parameters
Trait for components that take parameters
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Streaming
Real Time Stream Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
GraphX
Spark
SQL
MLlib
Spark
Streaming
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark Streaming
• Spark Streaming is an extension of Spark-core API that supports scalable, high
throughput and fault-tolerant streaming applications.
• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or
TCP sockets
• Data is processed using the now-familiar API: map, filter, reduce, join and window
• Processed data can be stored in databases, filesystems, or live dashboards
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
GraphX
Graph Processing
YARN
HDFS
Scala
Java
Python
APIs
Spark Core EngineSpark Core Engine
Spark
SQL
Spark
Streaming
MLlib GraphX
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Spark GraphX Graph API on Spark
Seamlessly work with graphs and collections
Growing library of graph algorithms
• SVD++, Connected Components, Triangle
Count, …
Iterative Graph Computations using
Pregel
Implements Valiant’s Bulk Synchronous
Parallel (BSP) model for distributing graph
algorithms.
Use Case
Social Media: Suggest new connections based
on existing relationships
Networking: Best routing through a given
network
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Part. 2
Part. 1
Vertex Table
(RDD)
B C
A D
F E
A D
Distributed Graphs as Tables (RDDs)
D
Property Graph
B C
D
E
AA
F
Edge Table
(RDD)
A B
A C
C D
B C
A E
A F
E F
E D
B
C
D
E
A
F
Routing
Table (RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
2D Vertex Cut Heuristic
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
How to Get Started with Spark
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Try Spark Today
Download the Hortonworks Sandbox
http://hortonworks.com/products/hortonworks-sandbox/
Go to the Apache Spark Website
http://spark.apache.org/
Learn Spark
Build a Proof of Concept
Test New Functionality
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
© Hortonworks Inc. 2013
Thank You!
Eric Mizell - Director, Solutions Engineering
emizell@hortonworks.com

More Related Content

What's hot

IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...In-Memory Computing Summit
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakSean Roberts
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureRandy Bias
 
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...In-Memory Computing Summit
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...Spark Summit
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 Andrey Vykhodtsev
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intromarpierc
 
OpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2DOpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2DAlessandro Pilotti
 
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...eNovance
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormJungtaek Lim
 

What's hot (20)

IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Hadoop Everywhere & Cloudbreak
Hadoop Everywhere & CloudbreakHadoop Everywhere & Cloudbreak
Hadoop Everywhere & Cloudbreak
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking Architecture
 
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
IMCSummit 2015 - Day 1 Developer Track - Open-Source In-Memory Platforms: Ben...
 
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
 
Camel Riders in the Cloud
Camel Riders in the CloudCamel Riders in the Cloud
Camel Riders in the Cloud
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlowBringing Real-Time to the Enterprise with Hortonworks DataFlow
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 
OpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2DOpenStack + Nano Server + Hyper-V + S2D
OpenStack + Nano Server + Hyper-V + S2D
 
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...OpenStack in Action 4!   Franz Meyer - What Use Case does Red Hat Enterprise ...
OpenStack in Action 4! Franz Meyer - What Use Case does Red Hat Enterprise ...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 

Viewers also liked

The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...All Things Open
 
Marketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all mathMarketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all mathAll Things Open
 
Trademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software ProjectTrademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software ProjectAll Things Open
 
Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²All Things Open
 
Open Source Systems Administration
Open Source Systems AdministrationOpen Source Systems Administration
Open Source Systems AdministrationAll Things Open
 
Sustainable Open Data Markets
Sustainable Open Data MarketsSustainable Open Data Markets
Sustainable Open Data MarketsAll Things Open
 
How Raleigh Became an Open Source City
How Raleigh Became an Open Source CityHow Raleigh Became an Open Source City
How Raleigh Became an Open Source CityAll Things Open
 
All Things Open Opening Keynote
All Things Open Opening KeynoteAll Things Open Opening Keynote
All Things Open Opening KeynoteAll Things Open
 
Open Sourcing the Public Library
Open Sourcing the Public LibraryOpen Sourcing the Public Library
Open Sourcing the Public LibraryAll Things Open
 
Software Development as a Civic Service
Software Development as a Civic ServiceSoftware Development as a Civic Service
Software Development as a Civic ServiceAll Things Open
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowAll Things Open
 
Great Artists (Designers) Steal
Great Artists (Designers) StealGreat Artists (Designers) Steal
Great Artists (Designers) StealAll Things Open
 
What Academia Can Learn from Open Source
What Academia Can Learn from Open SourceWhat Academia Can Learn from Open Source
What Academia Can Learn from Open SourceAll Things Open
 
JavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware PrototypingJavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware PrototypingAll Things Open
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and BeyondAll Things Open
 
Open Source in Healthcare
Open Source in HealthcareOpen Source in Healthcare
Open Source in HealthcareAll Things Open
 
Choosing a Javascript Framework
Choosing a Javascript FrameworkChoosing a Javascript Framework
Choosing a Javascript FrameworkAll Things Open
 
The Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting ToolsThe Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting ToolsAll Things Open
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudAll Things Open
 

Viewers also liked (20)

The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
The Anti-Henry Ford: How 200 hour discoveries revolutionized the way we do bu...
 
Marketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all mathMarketing is not all fluff; engineering is not all math
Marketing is not all fluff; engineering is not all math
 
Trademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software ProjectTrademarks and Your Free and Open Source Software Project
Trademarks and Your Free and Open Source Software Project
 
Women in Open Source
Women in Open SourceWomen in Open Source
Women in Open Source
 
Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²Giving a URL to All Objects using Beacons²
Giving a URL to All Objects using Beacons²
 
Open Source Systems Administration
Open Source Systems AdministrationOpen Source Systems Administration
Open Source Systems Administration
 
Sustainable Open Data Markets
Sustainable Open Data MarketsSustainable Open Data Markets
Sustainable Open Data Markets
 
How Raleigh Became an Open Source City
How Raleigh Became an Open Source CityHow Raleigh Became an Open Source City
How Raleigh Became an Open Source City
 
All Things Open Opening Keynote
All Things Open Opening KeynoteAll Things Open Opening Keynote
All Things Open Opening Keynote
 
Open Sourcing the Public Library
Open Sourcing the Public LibraryOpen Sourcing the Public Library
Open Sourcing the Public Library
 
Software Development as a Civic Service
Software Development as a Civic ServiceSoftware Development as a Civic Service
Software Development as a Civic Service
 
The Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To KnowThe Ember.js Framework - Everything You Need To Know
The Ember.js Framework - Everything You Need To Know
 
Great Artists (Designers) Steal
Great Artists (Designers) StealGreat Artists (Designers) Steal
Great Artists (Designers) Steal
 
What Academia Can Learn from Open Source
What Academia Can Learn from Open SourceWhat Academia Can Learn from Open Source
What Academia Can Learn from Open Source
 
JavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware PrototypingJavaScript and Internet Controlled Hardware Prototyping
JavaScript and Internet Controlled Hardware Prototyping
 
Javascript - The Stack and Beyond
Javascript - The Stack and BeyondJavascript - The Stack and Beyond
Javascript - The Stack and Beyond
 
Open Source in Healthcare
Open Source in HealthcareOpen Source in Healthcare
Open Source in Healthcare
 
Choosing a Javascript Framework
Choosing a Javascript FrameworkChoosing a Javascript Framework
Choosing a Javascript Framework
 
The Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting ToolsThe Gurubox Project: Open Source Troubleshooting Tools
The Gurubox Project: Open Source Troubleshooting Tools
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack Cloud
 

Similar to Apache Spark Fast Cluster Computing

Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin DataWorks Summit/Hadoop Summit
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8Janu Jahnavi
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFSri Ambati
 
Spark meets Spring
Spark meets SpringSpark meets Spring
Spark meets Springmark_fisher
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkRahul Kumar
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Alex Zeltov
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 

Similar to Apache Spark Fast Cluster Computing (20)

Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Spark 101
Spark 101Spark 101
Spark 101
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
Apache spark with java 8
Apache spark with java 8Apache spark with java 8
Apache spark with java 8
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Spark meets Spring
Spark meets SpringSpark meets Spring
Spark meets Spring
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 

More from All Things Open

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityAll Things Open
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best PracticesAll Things Open
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public PolicyAll Things Open
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...All Things Open
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashAll Things Open
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptAll Things Open
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?All Things Open
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractAll Things Open
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlowAll Things Open
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and SuccessAll Things Open
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with BackgroundAll Things Open
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblyAll Things Open
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksAll Things Open
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptAll Things Open
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramAll Things Open
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceAll Things Open
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamAll Things Open
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in controlAll Things Open
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsAll Things Open
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...All Things Open
 

More from All Things Open (20)

Building Reliability - The Realities of Observability
Building Reliability - The Realities of ObservabilityBuilding Reliability - The Realities of Observability
Building Reliability - The Realities of Observability
 
Modern Database Best Practices
Modern Database Best PracticesModern Database Best Practices
Modern Database Best Practices
 
Open Source and Public Policy
Open Source and Public PolicyOpen Source and Public Policy
Open Source and Public Policy
 
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
Weaving Microservices into a Unified GraphQL Schema with graph-quilt - Ashpak...
 
The State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil NashThe State of Passwordless Auth on the Web - Phil Nash
The State of Passwordless Auth on the Web - Phil Nash
 
Total ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScriptTotal ReDoS: The dangers of regex in JavaScript
Total ReDoS: The dangers of regex in JavaScript
 
What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?What Does Real World Mass Adoption of Decentralized Tech Look Like?
What Does Real World Mass Adoption of Decentralized Tech Look Like?
 
How to Write & Deploy a Smart Contract
How to Write & Deploy a Smart ContractHow to Write & Deploy a Smart Contract
How to Write & Deploy a Smart Contract
 
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
Spinning Your Drones with Cadence Workflows, Apache Kafka and TensorFlow
 
DEI Challenges and Success
DEI Challenges and SuccessDEI Challenges and Success
DEI Challenges and Success
 
Scaling Web Applications with Background
Scaling Web Applications with BackgroundScaling Web Applications with Background
Scaling Web Applications with Background
 
Supercharging tutorials with WebAssembly
Supercharging tutorials with WebAssemblySupercharging tutorials with WebAssembly
Supercharging tutorials with WebAssembly
 
Using SQL to Find Needles in Haystacks
Using SQL to Find Needles in HaystacksUsing SQL to Find Needles in Haystacks
Using SQL to Find Needles in Haystacks
 
Configuration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit InterceptConfiguration Security as a Game of Pursuit Intercept
Configuration Security as a Game of Pursuit Intercept
 
Scaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship ProgramScaling an Open Source Sponsorship Program
Scaling an Open Source Sponsorship Program
 
Build Developer Experience Teams for Open Source
Build Developer Experience Teams for Open SourceBuild Developer Experience Teams for Open Source
Build Developer Experience Teams for Open Source
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Sudo – Giving access while staying in control
Sudo – Giving access while staying in controlSudo – Giving access while staying in control
Sudo – Giving access while staying in control
 
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML ApplicationsFortifying the Future: Tackling Security Challenges in AI/ML Applications
Fortifying the Future: Tackling Security Challenges in AI/ML Applications
 
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
Securing Cloud Resources Deployed with Control Planes on Kubernetes using Gov...
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Apache Spark Fast Cluster Computing

  • 1. Apache Spark Lightening Fast Cluster Computing Eric Mizell – Director, Solution Engineering
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Apache Spark? Apache Open Source Project Distributed Compute Engine for fast and expressive data processing Designed for Iterative, In-Memory computations and interactive data mining Expressive Multi-Language APIs for Java, Scala, Python, and R Powerful Abstractions Enable data workers to rapidly iterate over data for: • ETL, Machine Learning, SQL, Stream Processing, and Graph Processing Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Why Spark? Elegant Developer APIs • Data Frames/SQL, Machine Learning, Graph algorithms and streaming • Scala, Python, Java and R • Single environment for pre-processing and Machine Learning In-memory computation model • Effective for iterative computations and machine learning Machine Learning On Hadoop • Implementation of distributed ML-algorithms • Pipeline API (Spark ML) Runs on Hadoop on YARN, Mesos, standalone
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Interactions with Spark Command Line • Scala shell – Scala/Java (./bin/spark-shell) • Python - (./bin/pyspark) Notebooks • Apache Zeppelin Notebook • Juptyer/IPython Notebook • IRuby Notebook ODBC/JDBC (Spark SQL only via Thrift) • Simba driver • DataDirect driver
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Introducing Apache Zeppelin Web-based Notebook for interactive analytics Features Ad-hoc experimentation Deeply integrated with Spark + Hadoop Supports multiple language backends Incubating at Apache Use Case Data exploration and discovery Visualization Interactive snippet-at-a-time experience “Modern Data Science Studio”
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Fundamental Abstraction: Resilient Distributed Datasets RDD Work with distributed collections as primitives RDD Properties • Immutable collections of objects spread across a cluster • Built through parallel transformations (map, filter, etc.) • Automatically rebuilt on failure • Controllable persistence (e.g. caching in RAM) Multiple Languages broad developer, partner and customer engagement RDD Partition 1 RDD Partition 2 RDD Partition 3Worker Node Worker Node Worker Node RDD LogicalSpark Driver sc = new SparkContext rDD =sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map … Developer Physical Writes RDD RDDs are collections of objects distributed across a cluster, cached in RAM or on disk. They are built through parallel transformations, automatically rebuilt on failure and immutable (each transformation creates a new RDD).
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What can developers do with RDDs? RDD Operations Transformations • e.g. map, filter, groupBy, join • Lazy operations to build RDDs from other RDDs Actions • e.g. count, collect, save • Return a result or write it to storage Other primitives • Accumulator • Broadcast Variables Developer Writes RDD Operations Writes Accumulator s Actions Broadcast Variables Transformations
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() messages.filter(lambda s: “foo” in s).count() messages.filter(lambda s: “bar” in s).count() . . . Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Mining Console Logs Load error messages from a log into memory, then interactively search for patterns
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved RDD Demo
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SQL SQL Access and Data Frames YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved YARN HDFS Spark SQL Table Structure integrated to work with tables and rows Hive Queries via Spark by Spark SQL Context can connect to Hive and query Hive Bindings to Python, Scala, Java, and R Data Frames new abstractions simplifies and speeds up SQL processing Spark Core Engine Spark SQL Data Frame DSL Spark SQL Data Frame API Data Source API
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storage What are Data Frames? Data Frames represent data in RDDs as a Table RDD is a low level abstraction –Think of RDD as bytecode and DataFrame as the Java Program Data Frame Properties –Data Frames attach schema to RDDs –Allows users to perform aggressive query optimizations –Brings the power of SQL to RDDs! dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Tuple Relational View Columnar Storage ORCFile Parquet Unstructured Data JSON CSV Text Avro Custom Weblog
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Data Frames are intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved DataFrame Demo YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark Streaming MLlib Spark SQL
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved MLlib Machine Learning Library YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL Spark Streaming MLlib
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved What is Machine Learning? Machine learning is the study of algorithms that learn concepts from data. A key aspect of learning is generalization: how well a learning algorithm is able to predict on unseen examples.
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Machine Learning Primitives Unsupervised Learning Clustering (K-means) Recommendation Collaborative Filtering - alternating least squares Dimensionality Reductions - Principal component analysis (PCA) and singular value decomposition (SVD) Supervised Learning Classification - Naïve Bayes, Decision Tree, Random Forest, Gradient Boosted Trees Regression - linear, logistic and Support Vector Machines (SVMs)
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Workflows are complex Q-Q Q-A similarit y Log Parsing, Cleanin g Ad category mapping Query category mapping Poly Exp (Q-A) Feature s Model Linear Solver train test Metrics • Feature Extraction Feature Extraction Ad Server Sponsored Search Advertising Pipeline Challenges: -> specify pipeline -> inspect and debug -> tune hyperparameters -> productionize HDFS
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved ML Pipeline makes ML workflows easier Transformer Transforms one dataset into another Estimator Fits model to data Pipeline Sequence of stages, consisting of estimators or transformers Parameters Trait for components that take parameters
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Streaming Real Time Stream Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine GraphX Spark SQL MLlib Spark Streaming
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark Streaming • Spark Streaming is an extension of Spark-core API that supports scalable, high throughput and fault-tolerant streaming applications. • Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets • Data is processed using the now-familiar API: map, filter, reduce, join and window • Processed data can be stored in databases, filesystems, or live dashboards
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved GraphX Graph Processing YARN HDFS Scala Java Python APIs Spark Core EngineSpark Core Engine Spark SQL Spark Streaming MLlib GraphX
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Spark GraphX Graph API on Spark Seamlessly work with graphs and collections Growing library of graph algorithms • SVD++, Connected Components, Triangle Count, … Iterative Graph Computations using Pregel Implements Valiant’s Bulk Synchronous Parallel (BSP) model for distributing graph algorithms. Use Case Social Media: Suggest new connections based on existing relationships Networking: Best routing through a given network
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D Distributed Graphs as Tables (RDDs) D Property Graph B C D E AA F Edge Table (RDD) A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1 2 1 2 1 2 1 2 2D Vertex Cut Heuristic
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved How to Get Started with Spark
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Try Spark Today Download the Hortonworks Sandbox http://hortonworks.com/products/hortonworks-sandbox/ Go to the Apache Spark Website http://spark.apache.org/ Learn Spark Build a Proof of Concept Test New Functionality
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved © Hortonworks Inc. 2013 Thank You! Eric Mizell - Director, Solutions Engineering emizell@hortonworks.com

Editor's Notes

  1. NEED SPEAKER NOTES
  2. NEED SPEAKER NOTES
  3. NEED SPEAKER NOTES
  4. TALK TRACK Ad-hoc experimentation Spark, Hive, Shell, Flink, Tajo, Ignite, Lens, etc Deeply integrated with Spark + Hadoop Can be managed via Ambari Stacks Supports multiple language backends Pluggable “Interpreters” Incubating at Apache 100% open source and open community [NEXT SLIDE]
  5. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  6. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pd
  7. Key idea: add “variables” to the “functions” in functional programming
  8. NEED SPEAKER NOTES
  9. NEED SPEAKER NOTES
  10. NEED SPEAKER NOTES
  11. Spark DataFrames represent tabular Data
  12. NEED SPEAKER NOTES
  13. NEED SPEAKER NOTES
  14. NEED SPEAKER NOTES
  15. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE]
  16. TALK TRACK [NEXT SLIDE]
  17. NEED SPEAKER NOTES
  18. NEED SPEAKER NOTES
  19. TALK TRACK Ensuring Spark is well integrated with YARN, Ambari, and Ranger enables enterprise to deploy Spark apps with confidence, and since HDP is available across Windows, Linux, on-premises and cloud deployment environments, it just makes it that much easier for enterprises to adopt it. [NEXT SLIDE] [RESOURCES] A vertex is an entity that can bring a bag of data (generally small) An edge connects vertices and can also own a bag of data https://amplab.cs.berkeley.edu/wp-content/uploads/2013/05/grades-graphx_with_fonts.pdf
  20. Takeaways Change order of interoperability slide Flush out no lock-in slide to talk about “proprietary open source”