This document provides an overview of developing analytical applications using Hadoop. It discusses how Hadoop allows storing and processing large amounts of data across clusters in a reliable and cost effective manner. It also discusses several frameworks that have been developed on top of Hadoop, including Apache Hive, Spark and GraphLab, to make it easier to develop analytical applications. The document advocates for structuring data in a way that makes sense for the problem and having interactive interfaces to yield more sophisticated answers.
1. Building Analytical Applications on PUBLICLY
DO NOT USE
Hadoop PRIOR TO 10/23/12
Headline Goes Here
Josh Wills | Director of Data Science
Speaker Name or Subhead Goes Here
November 2012
1
25. Big Data Economics
• No individual record is
particularly valuable
• Having every record is
incredibly valuable
• Web index
• Recommendation systems
• Sensor data
• Market basket analysis
• Online advertising
25
27. The Hadoop Distributed File System
• Based on the Google File
System
• Data stored in large files
• Large block size: 64MB to
256MB per block
• Blocks are replicated to
multiple nodes in the
cluster
27
28. Simple, Reliable Processing: MapReduce
• Map Stage
• Embarrassingly parallel
• Shuffle Stage: Large-scale distributed sort
• Reduce Stage
• Process all of the values that have the same key in a single step
• Process the data where it is stored
• Write once and you’re done.
28
31. The Best Way to Get Started: Apache Hive
• Apache Hive
• Data Warehouse System on
top of Hadoop
• SQL-based query language
• SELECT, INSERT, CREATE
TABLE
• Includes some MapReduce-
specific extensions
31
42. A Couple of Themes
1. Structure data the data in the way that makes sense for the
problem.
2. Interactive inputs, not just interactive outputs.
3. Simpler interfaces that yield more sophisticated answers.
42
46. It’s Frameworks All The Way Down: Spark
• Developed at Berkeley’s
AMP Lab
• Defines operations on
distributed in-memory
collections
• Written in Scala
• Supports reading to and
writing from HDFS
46
47. IFATWD: Graphlab
• Developed at CMU
• Lower-level primitives
• (but higher than MPI)
• Map/Reduce =>
Update/Sort
• Flexible, allows for
asynchronous
computations
• Reads from HDFS
47
They are applications that allow users to work with and make decisions from data.
It seems like there should be a UX equivalent of Clippy– maybe like a tiny picture of Edward Tufte– that pops up whenever someone decides to use a 3D pie chart.
http://square.github.com/crossfilter/
http://elections.nytimes.com/2012/results/president (Click on “Shift from 2008”)
Click on a state to zoom in
Frameworks != Analytical applicatons, for our purposes today. It’s not an analytical application until you put some data in it.
A few different models were developed for predicting the presidency in 2012– let’s consider a few of them.
http://isnatesilverawitch.com/Everyone predicted the election correctly. The RCP model got every state but Florida, PEC said it was a tossup, and 538 got every single state right.
MarkosMoulitsas over at theDailyKos did even better than Nate at predicting the share of the vote within the swing states. Don’t think that math can always out-perform an expert armed with good data.http://news.cnet.com/8301-13578_3-57546778-38/among-the-top-election-quants-nate-silver-reigns-supreme/
Index fund == simple average.Hedge fund == 538Warren Buffett == Expert with good data
Classical data economics: If the value I can extract from a byte is greater than the cost to store it, then I throw it away or store it on tape.
We use metaphors that help us understand new technology in terms of the old. Translatedesktop tools and metaphors on to Hadoop, even when we’re working with specialized data types: http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/
It’s a data warehousing metaphor– not an actual data warehouse. Schema on read vs. schema on write, for example. Non-interactive for the most part. Think of ELT, not interactive queries.
We borrow these abstractions because they make it easy to get started, but they don’t necessarily conform to the user’s expectations of how Hadoop will work.If you think of Hadoop as a really big database, or as a spreadsheet that goes on forever and ever, then you have failed to understand Hadoop.
Impala is about fulfilling those abstractions, esp. for interactive queries of relational-style data on Hadoop.
But we can also go beyond the abstractions and study how Hadoop can be effective for new kinds of analytic applications.
Step 1: Study real problems. Especially real problems where non-sophisticated users (e.g., people who don’t even know SQL) need to do sophisticated analysis on large quantities of information.
I realized earlier this year that other people do not use Hive the way that I use Hive, and so we created the data science course to take people through the problem of building an analytical application from start to finish on Hadoop.http://blog.cloudera.com/blog/2012/10/data-science-training/
They are applications that allow users to work with and make decisions from data.