This document provides an overview of Apache Spark and a hands-on workshop for using Spark. It begins with a brief history of Spark and how it evolved from Hadoop to address limitations in processing iterative tasks and keeping data in memory. Key Spark concepts are explained including RDDs, transformations, actions and Spark's execution model. New APIs in Spark SQL, DataFrames and Datasets are also introduced. The workshop agenda includes an overview of Spark followed by a hands-on example to rank Colorado counties by gender ratio using census data and both RDD and DataFrame APIs.
3. Andy Grove
Co-Founder & Chief Architect
Co-Founder @ Orbware
Technologies (acquired 2000)
Inventor of Firestorm/DAOâ¨
â¨
andy@agildata.com
⢠Providers of dbShards
⢠Relational Database Scalingâ¨
⢠Big Data Consulting
⢠Data Strategy
⢠Data Architecture Reviews
⢠Big Data Training
⢠Solution Implementation
⢠Distributed over 6 states!
⢠Headquartered in BroomďŹeld, CO
www.agildata.com
Dan Lynn
CEO
Co-Founder @ FullContact
15 years building software
Techstars 2011â¨
â¨
dan@agildata.com
4. AGENDA
⢠Part I - Overview of Spark
⢠Motivation, APIs, Ecosystem, Simple Example
⢠Part 2 - Hands On
⢠Work through a real data problem
6. A BRIEF HISTORY LESSON
⢠First there was Hadoop
⢠Goal: Process petabytes of constantly-growing data
⢠âMove the processing to the dataâ
⢠But MapReduce was diďŹcult to program
⢠So they made Pig, Hive, Cascading, etcâŚ
7. A BRIEF HISTORY LESSON
⢠MapReduce was also very reliable
⢠But it performed poorly on iterative tasks like machine
learning.
⢠So in 2009, UC Berkeley started on an new approach
⢠Keeping data in memory as much as possible.
8. A BRIEF HISTORY LESSON
⢠They called it âSparkâ
⢠After lots of community acceptance it became an Apache Project
in 2013.
⢠Since then, it has gained mainstream acceptance.
⢠âPotentially the Most SigniďŹcant Open Source Project of the Next
Decadeâ - IBM, June 15, 2015
9. A BRIEF HISTORY LESSON
⢠Huge ecosystem
⢠Machine learning: MLlib, Mahout
⢠Graph processing: GraphX
⢠Read from / write to anything that Hadoop can
⢠Tons of community contributions: spark-packages.org
⢠Zeppelin: Python-style interactive notebooks
11. CONCEPTS - RDD
RDD aka âResilient Distributed Datasetâ
your_data
f(your_data)
g(f(your_data))
<â an RDD
<â also an RDD
<â so is this
12. RDD - SECRET INTERNALS!!!11
/**
* Tells the Spark framework *where* the data is.
*/
protected Partition[] getPartitions();
/**
* Iterates through the data for a given partition.
*/
Iterator<T> compute(Partition split, TaskContext context);
13. RDD - PUBLIC API
⢠Transformations
⢠Make new RDDs by applying transformation functions.
⢠Actions
⢠Write to HDFS, write to databases, yield an answer, etcâŚ
Two Options
14. RDD - PUBLIC API
⢠Transformations
⢠.map(func) .filter(func) .reduce(func) .flatMap(func)
⢠Actions
⢠.collect() .saveAsTextFile(path) .sample(âŚ) .take(n)
19. SPARK SQL / DATAFRAME API
⢠New in Spark 1.3. The core engine behind Spark SQL
⢠If RDDs are transformations that apply to JVM objectsâŚ
⢠Schema (i.e. the class) is passed along with each datum
⢠Serialization pain. GC pain.
⢠âŚthen DataFrames are transformations that apply to data
⢠Schema is deďŹned for the entire set
⢠Data is transmitted independent of schema. JVM data access incurs much less GC overhead
⢠DataFrames have more optimized execution logic. i.e. a query planner
20. DATASET API
⢠New in Spark 1.6
⢠Addressed speciďŹc deďŹciencies in DataFrames
⢠DataFrames lack compile-time type-checking.
⢠Datasets look like RDDs, but perform like DataFrames
21. SPARK API CHOICES
Java Scala
RDD
DataFrame sketchyâŚ
Spark SQL
Dataset exciting, but very new exciting, but very new
24. PART 2: HANDS ON
⢠The problem: Rank Colorado counties by gender ratio.
⢠The data: US census data from 2010
⢠The approach:
⢠RDD API (in both Java 8 and Scala)
⢠DataFrame API / Spark SQL
⢠Dataset API