SlideShare a Scribd company logo
1 of 46
CONFIDENTIAL - RESTRICTED
Introduction to Spark
Scala SB Meetup
December 18th 2014
Maxime Dumas
Systems Engineer, Cloudera
Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• from Montreal, Canada
2
What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3
4
Quick and dirty, for context.
The Apache Hadoop Ecosystem
©2014 Cloudera, Inc. All rights
reserved.
• Scalability
• Simply scales just by adding nodes
• Local processing to avoid network bottlenecks
• Efficiency
• Cost efficiency (<$1k/TB) on commodity hardware
• Unified storage, metadata, security (no duplication or
synchronization)
• Flexibility
• All kinds of data (blobs, documents, records, etc)
• In all forms (structured, semi-structured, unstructured)
• Store anything then later analyze what you need
Why Hadoop?
Why “Ecosystem?”
• In the beginning, just Hadoop
• HDFS
• MapReduce
• Today, dozens of interrelated components
• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
6
HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
7
Lots of Commodity Machines
8
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
9
Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
10
Apache Hive Metastore
• Maps HDFS files to DB-like resources
• Databases
• Tables
• Column/field names, data types
• Roles/users
• InputFormat/OutputFormat
11
CDH: the App Store for Hadoop
12
Integration
Storage
Resource Management
Metadata
NoSQL
DBMS
…
Analytic
MPP
DBMS
Search
Engine
In-
Memory
Batch
Processing
System
Management
Data
Management
Support
Security
Machine
Learning
MapReduce
13
Introduction to Apache Spark
Credits:
• Ben White
• Todd Lipcon
• Ted Malaska
• Jairam Ranganathan
• Jayant Shekhar
• Sandy Ryza
Can we improve on MR?
• Problems with MR:
• Very low-level: requires a lot of code to do simple
things
• Very constrained: everything must be described as
“map” and “reduce”. Powerful but sometimes
difficult to think in these terms.
14
Can we improve on MR?
• Two approaches to improve on MapReduce:
1. Special purpose systems to solve one problem domain
well.
• Giraph / Graphlab (graph processing)
• Storm (stream processing)
• Impala (real-time SQL)
2. Generalize the capabilities of MapReduce to
provide a richer foundation to solve problems.
• Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs)
Both are viable strategies depending on the problem!
15
What is Apache Spark?
Spark is a general purpose computational framework
Retains the advantages of MapReduce:
• Linear scalability
• Fault-tolerance
• Data Locality based computations
…but offers so much more:
• Leverages distributed memory for better performance
• Supports iterative algorithms that are not feasible in MR
• Improved developer experience
• Full Directed Graph expressions for data parallel computations
• Comes with libraries for machine learning, graph analysis, etc.
16
What is Apache Spark?
Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
One of the largest open source projects in big data:
• 170+ developers contributing
• 30+ companies contributing
• 400+ discussions per month on the mailing list
17
Popular project
18
Getting started with Spark
• Java API
• Interactive shells:
• Scala (spark-shell)
• Python (pyspark)
19
Execution modes
20
Execution modes
• Standalone Mode
• Dedicated master and worker daemons
• YARN Client Mode
• Launches a YARN application with the
driver program running locally
• YARN Cluster Mode
• Launches a YARN application with the
driver program running in the YARN
ApplicationMaster
21
Dynamic resource
management
between Spark,
MR, Impala…
Dedicated Spark
runtime with static
resource limits
Spark Concepts
22
RDD – Resilient Distributed Dataset
• Collections of objects partitioned across a cluster
• Stored in RAM or on Disk
• You can control persistence and partitioning
• Created by:
• Distributing local collection objects
• Transformation of data in storage
• Transformation of RDDs
• Automatically rebuilt on failure (resilient)
• Contains lineage to compute from storage
• Lazy materialization
23
RDD transformations
24
Operations on RDDs
Transformations lazily transform a RDD
to a new RDD
• map
• flatMap
• filter
• sample
• join
• sort
• reduceByKey
• …
Actions run computation to return a
value
• collect
• reduce(func)
• foreach(func)
• count
• first, take(n)
• saveAs
• …
25
Fault Tolerance
• RDDs contain lineage.
• Lineage – source location and list of transformations
• Lost partitions can be re-computed from source data
26
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
27
Examples
Word Count in MapReduce
28
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}
Word Count in Spark
sc.textFile(“words”)
.flatMap(line => line.split(" "))
.map(word=>(word,1))
.reduceByKey(_+_).collect()
29
Logistic Regression
• Read two sets of points
• Looks for a plane W that separates them
• Perform gradient descent:
• Start with random W
• On each iteration, sum a function of W over the data
• Move W in a direction that improves it
30
Intuition
31
Logistic Regression
32
Logistic Regression Performance
33
34
Spark and Hadoop:
a Framework within a Framework
35
36
Integration
Storage
Resource Management
Metadata
HBase …Impala Solr Spark
Map
Reduce
System
Management
Data
Management
Support
Security
Spark Streaming
• Takes the concept of RDDs and extends it to DStreams
• Fault-tolerant like RDDs
• Transformable like RDDs
• Adds new “rolling window” operations
• Rolling averages, etc.
• But keeps everything else!
• Regular Spark code works in Spark Streaming
• Can still access HDFS data, etc.
• Example use cases:
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS.
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.
37
Micro-batching for on the fly ETL
38
What about SQL?
39
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
Fault Recovery Recap
• RDDs store dependency graph
• Because RDDs are deterministic:
Missing RDDs are rebuilt in parallel on other nodes
• Stateful RDDs can have infinite lineage
• Periodic checkpoints to disk clears lineage
• Faster recovery times
• Better handling of stragglers vs row-by-row streaming
40
Why Spark?
• Flexible like MapReduce
• High performance
• Machine learning,
iterative algorithms
• Interactive data
explorations
• Concise, easy API for
developer productivity
41
42
Demo Time!
• Log file Analysis
• Machine Learning
• Spark Streaming
What’s Next?
• Download Hadoop!
• CDH available at www.cloudera.com
• Try it online: Cloudera Live
• Cloudera provides pre-loaded VMs
• http://tiny.cloudera.com/quickstartvm
43
44
Preferably related to the talk… or not.
Questions?
45
Thank You!
Maxime Dumas
mdumas@cloudera.com
We’re hiring.
46

More Related Content

What's hot

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming宇 傅
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkWill Du
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 

What's hot (20)

Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark overview
Spark overviewSpark overview
Spark overview
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark and Spark Streaming
Spark and Spark StreamingSpark and Spark Streaming
Spark and Spark Streaming
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

Similar to Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraHandaru Sakti
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben whiteData Con LA
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptbhargavi804095
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2Gal Marder
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsNamuk Park
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Datio Big Data
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 

Similar to Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014 (20)

Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Why Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data EraWhy Functional Programming Is Important in Big Data Era
Why Functional Programming Is Important in Big Data Era
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Dive into spark2
Dive into spark2Dive into spark2
Dive into spark2
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 
Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)Apache Spark II (SparkSQL)
Apache Spark II (SparkSQL)
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 

Recently uploaded

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 

Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014

  • 1. CONFIDENTIAL - RESTRICTED Introduction to Spark Scala SB Meetup December 18th 2014 Maxime Dumas Systems Engineer, Cloudera
  • 2. Thirty Seconds About Max • Systems Engineer • aka Sales Engineer • SoCal, AZ, NV • former coder of PHP • teaches meditation + yoga • from Montreal, Canada 2
  • 3. What Does Cloudera Do? • product • distribution of Hadoop components, Apache licensed • enterprise tooling • support • training • services (aka consulting) • community 3
  • 4. 4 Quick and dirty, for context. The Apache Hadoop Ecosystem
  • 5. ©2014 Cloudera, Inc. All rights reserved. • Scalability • Simply scales just by adding nodes • Local processing to avoid network bottlenecks • Efficiency • Cost efficiency (<$1k/TB) on commodity hardware • Unified storage, metadata, security (no duplication or synchronization) • Flexibility • All kinds of data (blobs, documents, records, etc) • In all forms (structured, semi-structured, unstructured) • Store anything then later analyze what you need Why Hadoop?
  • 6. Why “Ecosystem?” • In the beginning, just Hadoop • HDFS • MapReduce • Today, dozens of interrelated components • I/O • Processing • Specialty Applications • Configuration • Workflow 6
  • 7. HDFS • Distributed, highly fault-tolerant filesystem • Optimized for large streaming access to data • Based on Google File System • http://research.google.com/archive/gfs.html 7
  • 8. Lots of Commodity Machines 8 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 9. MapReduce (MR) • Programming paradigm • Batch oriented, not realtime • Works well with distributed computing • Lots of Java, but other languages supported • Based on Google’s paper • http://research.google.com/archive/mapreduce.html 9
  • 10. Apache Hive • Abstraction of Hadoop’s Java API • HiveQL “compiles” down to MR • a “SQL-like” language • Eases analysis using MapReduce 10
  • 11. Apache Hive Metastore • Maps HDFS files to DB-like resources • Databases • Tables • Column/field names, data types • Roles/users • InputFormat/OutputFormat 11
  • 12. CDH: the App Store for Hadoop 12 Integration Storage Resource Management Metadata NoSQL DBMS … Analytic MPP DBMS Search Engine In- Memory Batch Processing System Management Data Management Support Security Machine Learning MapReduce
  • 13. 13 Introduction to Apache Spark Credits: • Ben White • Todd Lipcon • Ted Malaska • Jairam Ranganathan • Jayant Shekhar • Sandy Ryza
  • 14. Can we improve on MR? • Problems with MR: • Very low-level: requires a lot of code to do simple things • Very constrained: everything must be described as “map” and “reduce”. Powerful but sometimes difficult to think in these terms. 14
  • 15. Can we improve on MR? • Two approaches to improve on MapReduce: 1. Special purpose systems to solve one problem domain well. • Giraph / Graphlab (graph processing) • Storm (stream processing) • Impala (real-time SQL) 2. Generalize the capabilities of MapReduce to provide a richer foundation to solve problems. • Tez, MPI, Hama/Pregel (BSP), Dryad (arbitrary DAGs) Both are viable strategies depending on the problem! 15
  • 16. What is Apache Spark? Spark is a general purpose computational framework Retains the advantages of MapReduce: • Linear scalability • Fault-tolerance • Data Locality based computations …but offers so much more: • Leverages distributed memory for better performance • Supports iterative algorithms that are not feasible in MR • Improved developer experience • Full Directed Graph expressions for data parallel computations • Comes with libraries for machine learning, graph analysis, etc. 16
  • 17. What is Apache Spark? Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. One of the largest open source projects in big data: • 170+ developers contributing • 30+ companies contributing • 400+ discussions per month on the mailing list 17
  • 19. Getting started with Spark • Java API • Interactive shells: • Scala (spark-shell) • Python (pyspark) 19
  • 21. Execution modes • Standalone Mode • Dedicated master and worker daemons • YARN Client Mode • Launches a YARN application with the driver program running locally • YARN Cluster Mode • Launches a YARN application with the driver program running in the YARN ApplicationMaster 21 Dynamic resource management between Spark, MR, Impala… Dedicated Spark runtime with static resource limits
  • 23. RDD – Resilient Distributed Dataset • Collections of objects partitioned across a cluster • Stored in RAM or on Disk • You can control persistence and partitioning • Created by: • Distributing local collection objects • Transformation of data in storage • Transformation of RDDs • Automatically rebuilt on failure (resilient) • Contains lineage to compute from storage • Lazy materialization 23
  • 25. Operations on RDDs Transformations lazily transform a RDD to a new RDD • map • flatMap • filter • sample • join • sort • reduceByKey • … Actions run computation to return a value • collect • reduce(func) • foreach(func) • count • first, take(n) • saveAs • … 25
  • 26. Fault Tolerance • RDDs contain lineage. • Lineage – source location and list of transformations • Lost partitions can be re-computed from source data 26 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 28. Word Count in MapReduce 28 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
  • 29. Word Count in Spark sc.textFile(“words”) .flatMap(line => line.split(" ")) .map(word=>(word,1)) .reduceByKey(_+_).collect() 29
  • 30. Logistic Regression • Read two sets of points • Looks for a plane W that separates them • Perform gradient descent: • Start with random W • On each iteration, sum a function of W over the data • Move W in a direction that improves it 30
  • 34. 34 Spark and Hadoop: a Framework within a Framework
  • 35. 35
  • 36. 36 Integration Storage Resource Management Metadata HBase …Impala Solr Spark Map Reduce System Management Data Management Support Security
  • 37. Spark Streaming • Takes the concept of RDDs and extends it to DStreams • Fault-tolerant like RDDs • Transformable like RDDs • Adds new “rolling window” operations • Rolling averages, etc. • But keeps everything else! • Regular Spark code works in Spark Streaming • Can still access HDFS data, etc. • Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS. • Detecting anomalous behavior and triggering alerts. • Continuous reporting of summary metrics for incoming data. 37
  • 38. Micro-batching for on the fly ETL 38
  • 40. Fault Recovery Recap • RDDs store dependency graph • Because RDDs are deterministic: Missing RDDs are rebuilt in parallel on other nodes • Stateful RDDs can have infinite lineage • Periodic checkpoints to disk clears lineage • Faster recovery times • Better handling of stragglers vs row-by-row streaming 40
  • 41. Why Spark? • Flexible like MapReduce • High performance • Machine learning, iterative algorithms • Interactive data explorations • Concise, easy API for developer productivity 41
  • 42. 42 Demo Time! • Log file Analysis • Machine Learning • Spark Streaming
  • 43. What’s Next? • Download Hadoop! • CDH available at www.cloudera.com • Try it online: Cloudera Live • Cloudera provides pre-loaded VMs • http://tiny.cloudera.com/quickstartvm 43
  • 44. 44 Preferably related to the talk… or not. Questions?
  • 46. 46

Editor's Notes

  1. Similar to the Red Hat model. Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
  2. We’re going to breeze through these really quick, just to show how Search plugs in later…
  3. Lose a server, no problem. Lose a rack, no problem.