The document summarizes a meetup on Apache Spark hosted by Data Science London. It introduces the speakers - Sameer Farooqui, Doug Bateman, and Jon Bates - and their backgrounds in data science and Spark training. The agenda includes talks on a power plant predictive modeling demo using Spark and different approaches to parallelizing machine learning algorithms in Spark like model, divide and conquer, and data parallelism. It also provides overviews of Spark's machine learning library MLlib and common algorithms. The goal is for attendees to learn about Spark's unified engine and how to apply different machine learning techniques at scale.
2. Who are we?
Sameer Farooqui Doug Bateman Jon Bates
• Dir of Training @ NewCircle
• Spark Trainer for Databricks
• 800+ trainings on Java,
Python, Android, Hibernate,
Spring, etc
• Trainer @ Databricks
• 150+ trainings on Hadoop,
C*, HBase, Couchbase,
NoSQL, etc
• Data Scientist
• Consultant for Databricks
• EdX assistant instructor on
Scalable ML w/ Spark
3. Agenda: Talks
Sameer Farooqui Doug Bateman Jon Bates
15
mins:
• Intro & Spark Overview
25
mins:
• Power Plant Demo
• ETL + Linear Regression
25
mins:
• Iris Flower Demo
• Model Parallel w/ sci-kit
learn
4. Agenda: Q & A 30
mins
+
• Consulting Architect for Cloudera
• Cluster setup, Security/Kerberos,
Hive, Impala, HBase, Spark
• Based in Germany
• R, Sci-Kit Learn, Spark, Mahout, HBase,
Hive, Pig
• Senior Data Scientist @ Big Data
Partnership + Spark Trainer for DB
• Based in London
Stephane Rion
Lars Francke
5. Who are you?
1) I have used Spark hands on before…
2) I have more than 1 year hands on experience with ML…
22. Spark Data Model
item-‐1
item-‐2
item-‐3
item-‐4
item-‐5
item-‐6
item-‐6
item-‐8
item-‐9
item-‐10
Ex
RD
DRD
D
Ex
RD
DRD
D
Ex
RD
D
more
par((ons
=
more
parallelism
24. Use Case: predict power output given a set of readings from various
sensors in a gas-fired power generation plant
Schema Definition:
AT
=
Atmospheric
Temperature
in
C
V
=
Exhaust
Vacuum
Speed
AP
=
Atmospheric
Pressure
RH
=
RelaCve
Humidity
PE
=
Power
Output
(value
we
are
trying
to
predict)
28. Different ways to parallelize ML
• Model Parallelism
• Divide & Conquer
• Data Parallelism
29. Model Parallelism
• Model stored across workers
• Communicate data to all workers
• Examples:
• Grid search
• Cross validation
• Ensemble
30. Divide & Conquer
• Minimizes communication
• Leads to approximate solutions
31. Data Parallelism
• Data stored across workers
• Communicate model to all
workers
• Examples:
• MLLib Linear models
• Matrix outer products
32. Scalability Rules
1st Rule of thumb
Computation & Storage should be linear (in n, d )
2nd Rule of thumb
Perform parallel and in-memory computation
3rd Rule of thumb
Minimize Network Communication
33. Agenda: Q & A 30
mins
Stephane Rion
Lars Francke
Sameer Farooqui
Doug Bateman
Jon Bates