Lions and tigers and Spark, oh my! It's hard enough keeping up with the explosion in data, but just keeping track of the tools is a challenge. What is Big Data? How do I become a data scientist? How can I leverage the cloud? (What is the cloud?). These are all tough questions for anyone to answer, let alone the business analyst who does not have a strong programming and technology background. Put your mind at ease - we are here to help.
This talk will introduce the open source processing engine, Spark, highlighting not only it's awesome power but how it fits within the larger data landscape. You will learn why Spark was developed to crunch through Big Data, what MapReduce is, and why Spark can beat the pants off of it in terms of performance and ease of use. You will learn how non-programmers can get started with Spark (warning: there is no escaping code, but you can do it, we promise), where you can find great tutorials on Spark, and how you can use Spark in the cloud with IBM.
At the end of this talk you will be able to firmly place Spark in the Big Data ecosystem and articulate to your colleagues how data processing platforms have evolved to handle large amounts of data. You will know how to get started using Spark and be comfortable enough with Spark syntax to write a few lines of code like the boss you are. This talk will be fast and furious, but fun. Fasten your seatbelts and get ready to learn about Spark.
2. WHY ARE WE HERE?
Business analysts use data to inform business decisions.
Spark is one of many tools that can help you do that.
3. SO LET'S DIVE RIGHT IN
val input = sc.textfile("file:///test.csv")
input.collect().foreach(println)
This code just loads a file and prints it out to the screen
4. BIG CAVEAT
We will be coding
No, there is no other way
Yes, it will be hard
But you can do it
5. HERE'S HOW I KNOW...
Excel formulas are super hard
=VLOOKUP(B2,'Raw Data'!$B$1:$D$2,3,FALSE)
=SUMPRODUCT((A1:A10="Ford")*(B1:B10="June")*(C1:C10))
If you learned how to write VLOOKUPs, you can learn to
code
6. DISTINCTION: WE ARE NOT
ENGINEERS
We are not building production applications
We just want to answer questions with data rather than with
speculation
7. WE MAY SHARE TOOLS WITH
ENGINEERS, BUT OUR PROCESS IS
DIFFERENT
Principally, we emphasize interactive analysis
This means we want the flexibility to change the questions
we ask as we work
8. AND THE ABILITY TO STOP OUR
ANALYSIS AT ANY POINT
We are not doing analysis for the sake of doing analysis
Good may be the enemy of great, but better is the enemy of
done
10. OUR ANALYTIC PROCESS
Don't measure, just cut
Google is your best friend
You don't have to know how to do anything
You just have to be able to find out
11. WHAT IS SPARK?
Spark is an open-source processing framework designed for
cluster computing
12. WHY IS IT POPULAR?
Super fast...
Plays well with Hadoop
Native APIs for analyst friendly languages like Python and
R
13. WAIT...I'VE HEARD THIS BEFORE
Sounds like the original promise of Hadoop...
How is Spark different?
14. FAST REVIEW OF HADOOP
Google was indexing the web every day
They wrote some custom software to store and process
those documents (web pages)
The open source version of that software is called Hadoop
15. HADOOP CONSISTS OF TWO MAIN
PIECES
The Hadoop Distributed File System: HDFS
And a processing framework called MapReduce
HDFS enabled fault-tolerant storage on commodity servers
at scale
And MapReduce allowed you to process what you stored in
parallel
16. THIS IS A BIG DEAL...
Companies storing ever increasing amounts of data could:
Do so much cheaper
With more flexibility
17. HADOOP CAME WITH A COST
Parallel processing, but not necessarily fast (batch
processing)
Difficult to program
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<longwritab
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
18. NOT INTERACTIVE
Writing MapReduce jobs in Java is an inefficient way for
business analysts to process data in parallel
We get the parallel processing speed, but the development
time is long (or the time spent asking a dev to write it...)
19. BUT WHAT ABOUT PIG..?
Pig is a sort of scripting language for Hadoop with friendly
syntax that lets you read from any data source
A = load './input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into './wordcount';
While it works well, it's another language to learn and it is
only used in Hadoop
20. BUT WHAT ABOUT SQL-ON-
HADOOP?
A few options: Hive, Impala, Big SQL
If you have these options, use them
But they all involve substantial ETL and (maybe) additional
hardware
In D.C. we know what that means: you get it on next year's
contract
21. WHAT IS ETL? AND WHY WOULD WE
NEED IT?
Because unlike most Hadoop tutorials, the data analysts
access are not in flat files
For analytics, it is very likely you'll want data from your
Hadoop application's database
But what is your Hadoop application's database?
22. HBASE - THE HADOOP DATABASE
One big freakin' table
No joins - row keys are everything
Great for applications, terrible for analysts
23. WHY AM I TALKING ABOUT HBASE
DURING A SPARK PRESENTATION?
Because I want you to know that your data will not be in the
format you want
ETL - Extract, Transform, Load, is a real process that
engineers will have to spend time on to get your data into a
SQL friendly environment
This will not be an application feature, but an analytics one
(so don't be surprised if this gets skipped)
24. MY RAMBLING POINT IS THAT YOU
WILL HAVE MESSY DATA
Hadoop, Spark, Tableau, nor anything else will solve that
You still have to rely on the tools you use for data wrangling
Like Python and R
25. TOOL COMPARISON
Tool Powerful? Friendly?
Excel No Hell Yes
Python/R Meh... Yes
Hadoop Yes Hell no
Spark Hell yes Just right
26. IDEAL SCENARIO
I can write the same Python scripts that I use to process data
on my local machine
27. SPARK IS OUR BEST ANSWER
You can write Python and iterative computations are
processed in memory, so they are easier to write and much
faster than MapReduce
28. HOW YOU CAN GET STARTED
Big Data University
Spark on Bluemix