Introduction to Spark (Intern Event Presentation)

•

15 gefällt mir•2,939 views

Databricks

An introduction to Apache Spark from its creator, Matei Zaharia, for the intern event hosted by Databricks.

Software

Introduction to Spark
Matei Zaharia
Databricks Intern Event, August 2015

What is Apache Spark?
Fast and general computing engine for clusters
Makes it easy and fast to process large datasets
• APIs in Java, Scala, Python, R
• Libraries for SQL, streaming, machine learning, …
• 100x faster than Hadoop MapReduce for some apps

About Databricks
Founded by creators of Spark in 2013
Oﬀers a hosted cloud service built on Spark
• Interactive workspace with notebooks, dashboards, jobs

0
20
40
60
80
100
120
140
160
2010 2011 2012 2013 2014 2015
Contributors
Contributors / Month to Spark
Community Growth
Most active open source project in
big data

Spark Programming Model
Write programs in terms of transformations on
distributed datasets
Resilient Distributed Datasets (RDDs)
• Collections of objects stored in memory or disk across a cluster
• Built via parallel transformations (map, filter, …)
• Automatically rebuilt on failure

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines
=
spark.textFile(“hdfs://...”)

errors
=
lines.filter(lambda
s:
s.startswith(“ERROR”))

messages
=
errors.map(lambda
s:
s.split(‘t’)[2])

messages.cache()

Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda
s:
“MySQL”
in
s).count()

messages.filter(lambda
s:
“Redis”
in
s).count()

.
.
.

tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in
0.5 sec (vs 20s for on-disk data)

Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
RunningTime(s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s
further iterations 1 s
Iterative algorithm used in machine learning

Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines2013 Record:
Hadoop
72 minutes
2014 Record:
Spark
207 machines
23 minutes
On-Disk Performance
Time to sort 100TB

Higher-Level Libraries
Spark
Spark
Streaming
real-time
Spark SQL
structured data
MLlib
machine
learning
GraphX
graph

Higher-Level Libraries
//
Load
data
using
SQL

points
=
ctx.sql(“select
latitude,
longitude
from
tweets”)

//
Train
a
machine
learning
model

model
=
KMeans.train(points,
10)

//
Apply
it
to
a
stream

sc.twitterStream(...)

.map(lambda
t:
(model.predict(t.location),
1))

.reduceByWindow(“5s”,
lambda
a,
b:
a
+
b)

Over 1000 production users, clusters up to 8000 nodes
Many talks online at spark-summit.org
Spark Community

Ongoing Work
Speeding up Spark through code generation and
binary processing (Project Tungsten)
R interface to Spark (SparkR)
Real-time machine learning library
Frontend and backend work in Databricks
(visualization, collaboration, auto-scaling, …)

Empfohlen

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Introduction to Apache SparkRahul Jain

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache Spark overviewDataArt

Introduction to apache spark Aakashdata

Apache Spark ArchitectureAlexey Grishchenko

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

Empfohlen

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

Introduction to Apache SparkRahul Jain

Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks

Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov

Apache Spark overviewDataArt

Introduction to apache spark Aakashdata

Apache Spark ArchitectureAlexey Grishchenko

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn

SparkAmir Payberah

Spark introduction and architectureSohil Jain

Introduction to Spark Streamingdatamantra

Apache Spark 101Abdullah Çetin ÇAVDAR

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Apache SparkAnastasios Skarlatidis

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Introduction to Spark InternalsPietro Michiardi

Spark SQLJoud Khattab

Apache Spark FundamentalsZahra Eskandari

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Introduction to YARN and MapReduce 2Cloudera, Inc.

Parquet performance tuning: the missing guideRyan Blue

Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks

Introduction to PySparkRussell Jurney

Apache Spark Introductionsudhakara st

Spark streamingWhiteklay

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Spark shuffle introductioncolorant

Apache sparkTEJPAL GAUTAM

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Weitere ähnliche Inhalte

Was ist angesagt?

SparkAmir Payberah

Spark introduction and architectureSohil Jain

Introduction to Spark Streamingdatamantra

Apache Spark 101Abdullah Çetin ÇAVDAR

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab

Introduction to Apache SparkAnastasios Skarlatidis

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Introduction to Spark InternalsPietro Michiardi

Spark SQLJoud Khattab

Apache Spark FundamentalsZahra Eskandari

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Introduction to YARN and MapReduce 2Cloudera, Inc.

Parquet performance tuning: the missing guideRyan Blue

Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks

Introduction to PySparkRussell Jurney

Apache Spark Introductionsudhakara st

Spark streamingWhiteklay

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks

Spark shuffle introductioncolorant

Apache sparkTEJPAL GAUTAM

Was ist angesagt? (20)

Spark

Spark introduction and architecture

Introduction to Spark Streaming

Apache Spark 101

Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab

Introduction to Apache Spark

Apache spark - Architecture , Overview & libraries

Introduction to Spark Internals

Spark SQL

Apache Spark Fundamentals

Top 5 Mistakes When Writing Spark Applications

Introduction to YARN and MapReduce 2

Parquet performance tuning: the missing guide

Easy, scalable, fault tolerant stream processing with structured streaming - ...

Introduction to PySpark

Apache Spark Introduction

Spark streaming

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...

Spark shuffle introduction

Apache spark

Andere mochten auch

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

2016 Spark Summit East Keynote: Matei ZahariaDatabricks

Introduction to Apache Sparkdatamantra

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Parallelizing Existing R Packages with SparkRDatabricks

Internship presentationsamcrosier

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksData Con LA

Spark - The beginningsDaniel Leon

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly

Apache SparkMahdi Esmailoghli

Intro to Apache SparkCloudera, Inc.

Apache spark linkedinYukti Kaura

New directions for Apache Spark in 2015Databricks

The Evolution of Data Analysis with Hadoop - StampedeCon 2014StampedeCon

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...Chris Fregly

Extreme-scale Ad-Tech using Spark and Databricks at MediaMathSpark Summit

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Apache spark - Spark's distributed programming modelMartin Zapletal

Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly

Andere mochten auch (20)

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Spark Under the Hood - Meetup @ Data Science London

2016 Spark Summit East Keynote: Matei Zaharia

Introduction to Apache Spark

Apache Spark 2.0: Faster, Easier, and Smarter

Parallelizing Existing R Packages with SparkR

Internship presentation

Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks

Spark - The beginnings

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Apache Spark

Intro to Apache Spark

Apache spark linkedin

New directions for Apache Spark in 2015

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

Madrid Spark Big Data Bluemix Meetup - Spark Versus Hadoop @ 100 TB Daytona G...

Extreme-scale Ad-Tech using Spark and Databricks at MediaMath

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Apache spark - Spark's distributed programming model

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Ähnlich wie Introduction to Spark (Intern Event Presentation)

Unified Big Data Processing with Apache SparkC4Media

Unified Big Data Processing with Apache Spark (QCON 2014)Databricks

End-to-end Data Pipeline with Apache SparkDatabricks

20170126 big data processingVienna Data Science Group

Simplifying Big Data Analytics with Apache SparkDatabricks

Artigo 81 - spark_tutorial.pdfWalmirCouto3

Apache Spark Overview @ ferretAndrii Gakhov

Paris Data Geek - Spark Streaming Djamel Zouaoui

Jump Start with Apache Spark 2.0 on DatabricksDatabricks

20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy

An introduction To Apache SparkAmir Sedighi

Spark streaming , Spark SQLYousun Jeong

Apache Spark RDDsDean Chen

Apache Spark & HadoopMapR Technologies

Apache spark-melbourne-april-2015-meetupNed Shawa

In Memory Analytics with Apache SparkVenkata Naga Ravi

Spark Study NotesRichard Kuo

Azure Databricks is Easier Than You ThinkIke Ellis

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event

Ähnlich wie Introduction to Spark (Intern Event Presentation) (20)

Unified Big Data Processing with Apache Spark

Unified Big Data Processing with Apache Spark (QCON 2014)

End-to-end Data Pipeline with Apache Spark

20170126 big data processing

Simplifying Big Data Analytics with Apache Spark

Artigo 81 - spark_tutorial.pdf

Apache Spark Overview @ ferret

Paris Data Geek - Spark Streaming

Jump Start with Apache Spark 2.0 on Databricks

20130912 YTC_Reynold Xin_Spark and Shark

Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms

An introduction To Apache Spark

Spark streaming , Spark SQL

Apache Spark RDDs

Apache Spark & Hadoop

Apache spark-melbourne-april-2015-meetup

In Memory Analytics with Apache Spark

Spark Study Notes

Azure Databricks is Easier Than You Think

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Mehr von Databricks

DW Migration Webinar-March 2022.pptxDatabricks

Data Lakehouse Symposium | Day 1 | Part 1Databricks

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Data Lakehouse Symposium | Day 2Databricks

Data Lakehouse Symposium | Day 4Databricks

5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks

Democratizing Data Quality Through a Centralized PlatformDatabricks

Learn to Use Databricks for Data ScienceDatabricks

Why APM Is Not the Same As ML MonitoringDatabricks

The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks

Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks

Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks

Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks

Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks

Sawtooth Windows for Feature AggregationsDatabricks

Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks

Re-imagine Data Monitoring with whylogs and SparkDatabricks

Raven: End-to-end Optimization of ML Prediction QueriesDatabricks

Processing Large Datasets for ADAS Applications using Apache SparkDatabricks

Massive Data Processing in Adobe Using Delta LakeDatabricks

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Kürzlich hochgeladen

Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...masabamasaba

WSO2CON2024 - It's time to go PlatformlessWSO2

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

WSO2CON 2024 - Does Open Source Still Matter?WSO2

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2

%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba

The title is not connected to what is insideshinachiaurasa2

What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

Kürzlich hochgeladen (20)

Artyushina_Guest lecture_YorkU CS May 2024.pptx

%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...

WSO2CON2024 - It's time to go Platformless

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...

WSO2CON 2024 - Does Open Source Still Matter?

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

%in Soweto+277-882-255-28 abortion pills for sale in soweto

%in kempton park+277-882-255-28 abortion pills for sale in kempton park

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...

The title is not connected to what is inside

What Goes Wrong with Language Definitions and How to Improve the Situation

Microsoft AI Transformation Partner Playbook.pdf

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...

8257 interfacing 2 in microprocessor for btech students

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf

%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein

%in tembisa+277-882-255-28 abortion pills for sale in tembisa

Introduction to Spark (Intern Event Presentation)

1. Introduction to Spark Matei Zaharia Databricks Intern Event, August 2015

2. What is Apache Spark? Fast and general computing engine for clusters Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps

3. About Databricks Founded by creators of Spark in 2013 Oﬀers a hosted cloud service built on Spark • Interactive workspace with notebooks, dashboards, jobs

4. 0 20 40 60 80 100 120 140 160 2010 2011 2012 2013 2014 2015 Contributors Contributors / Month to Spark Community Growth Most active open source project in big data

5. Spark Programming Model Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

6. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(‘t’)[2]) messages.cache() Block 1 Block 2 Block 3 Worker Worker Worker Driver messages.filter(lambda s: “MySQL” in s).count() messages.filter(lambda s: “Redis” in s).count() . . . tasks results Cache 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)

7. Example: Logistic Regression 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 RunningTime(s) Number of Iterations Hadoop Spark 110 s / iteration first iteration 80 s further iterations 1 s Iterative algorithm used in machine learning

8. Source: Daytona GraySort benchmark, sortbenchmark.org 2100 machines2013 Record: Hadoop 72 minutes 2014 Record: Spark 207 machines 23 minutes On-Disk Performance Time to sort 100TB

9. Higher-Level Libraries Spark Spark Streaming real-time Spark SQL structured data MLlib machine learning GraphX graph

10. Higher-Level Libraries // Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”) // Train a machine learning model model = KMeans.train(points, 10) // Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

11. Demo

12. Over 1000 production users, clusters up to 8000 nodes Many talks online at spark-summit.org Spark Community

13.

14. Ongoing Work Speeding up Spark through code generation and binary processing (Project Tungsten) R interface to Spark (SparkR) Real-time machine learning library Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)

15. Thank you. We’re hiring!

Hinweis der Redaktion

Add “variables” to the “functions” in functional programming
100 GB of data on 50 m1.xlarge EC2 machines
Alibab, tenzent At Berkeley, we have been working on a solution since 2009. This solution consists of a software stack for data analytics, called the Berkeley Data Analytics Stack. The centerpiece of this stack is Spark. Spark has seen significant adoption with hundreds of companies using it, out of which around sixteen companies have contributed back the code. In addition, Spark has been deployed on clusters that exceed 1,000 nodes.