Building Analytical Apps on Hadoop

•Download as PPTX, PDF•

5 likes•923 views

This document provides an overview of developing analytical applications using Hadoop. It discusses how Hadoop allows storing and processing large amounts of data across clusters in a reliable and cost effective manner. It also discusses several frameworks that have been developed on top of Hadoop, including Apache Hive, Spark and GraphLab, to make it easier to develop analytical applications. The document advocates for structuring data in a way that makes sense for the problem and having interactive interfaces to yield more sophisticated answers.

Building Analytical Applications on PUBLICLY
DO NOT USE
Hadoop PRIOR TO 10/23/12
Headline Goes Here
Josh Wills | Director of Data Science
Speaker Name or Subhead Goes Here
November 2012

1

New York Times Electoral Vote Map (Detail)

7

Analytical Applications vs. Frameworks

8

Developing Analytical Applications
A Case Study

9

2012: The Predicting of the President

10

RealClearPolitics

• Simple Average of Polls

• Transparent

• Simple Interactions

11

FiveThirtyEight

• “Foxy” Model

• Opaque

• Simple Interactions with
a richer UI

12

Princeton Election Consortium

• Medians and
Polynomials

• Transparent

• Rich Interactions

13

Here’s the Rub: One Expert Beat Nate

18

Index Funds, Hedge Funds, and Warren Buffett

19

Data Storage in 2001: Databases

• Structured schemas
• Intensive processing
done where data is
stored
• Somewhat reliable
• Expensive at scale

21

Data Storage in 2001: Filers

• No schemas, stores any
kind of file
• No data processing
capability
• Reliable
• Expensive at scale

22

Big Data Economics

• No individual record is
particularly valuable
• Having every record is
incredibly valuable
• Web index
• Recommendation systems
• Sensor data
• Market basket analysis
• Online advertising

25

The Hadoop Distributed File System

• Based on the Google File
System
• Data stored in large files
• Large block size: 64MB to
256MB per block
• Blocks are replicated to
multiple nodes in the
cluster

27

Simple, Reliable Processing: MapReduce
• Map Stage
• Embarrassingly parallel
• Shuffle Stage: Large-scale distributed sort
• Reduce Stage
• Process all of the values that have the same key in a single step
• Process the data where it is stored
• Write once and you’re done.

28

Developing Analytical Applications
with Hadoop

29

The Best Way to Get Started: Apache Hive

• Apache Hive
• Data Warehouse System on
top of Hadoop
• SQL-based query language
• SELECT, INSERT, CREATE
TABLE
• Includes some MapReduce-
specific extensions

31

Improving the UX (http://github.com/cloudera/impala)

33

A Couple of Themes
1. Structure data the data in the way that makes sense for the
problem.

2. Interactive inputs, not just interactive outputs.

3. Simpler interfaces that yield more sophisticated answers.

42

Developing Analytical Applications
Moving Beyond MapReduce

44

The Cambrian Explosion…of Frameworks 

45

It’s Frameworks All The Way Down: Spark

• Developed at Berkeley’s
AMP Lab
• Defines operations on
distributed in-memory
collections
• Written in Scala
• Supports reading to and
writing from HDFS

46

IFATWD: Graphlab

• Developed at CMU
• Lower-level primitives
• (but higher than MPI)
• Map/Reduce =>
Update/Sort
• Flexible, allows for
asynchronous
computations
• Reads from HDFS

47

BranchReduce (http://github.com/cloudera/branchreduce)

49

What's hot

Big data technology omer mohamed abd alrhman

Big data analysis using hadoop clusterFurqan Haider

From Big Data to Fast DataSina Sheikholeslami

"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...Dataconomy Media

Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Dataconomy Media

QuantCell Research - The Big Data Spreadsheetinside-BigData.com

Obfuscating LinkedIn Member DataDataWorks Summit

Data mining tools used in business intelligenceNithya Ravi

Big Data Streams Architectures. Why? What? How?Anton Nazaruk

Real-Time Big DataHandaru Sakti

Telco analytics at scaledatamantra

Modern Big Data Analytics Tools: An OverviewGreat Wide Open

The Rise of Engineering-Driven Analytics by Loren ShureBig Data Spain

"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...Maya Lumbroso

Big data real time architecturesDaniel Marcous

Managed Cluster ServicesAdam Doyle

Data pipelines from zero Lars Albertsson

Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Databricks

Organising for Data SuccessLars Albertsson

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media

What's hot (20)

Big data technology

Big data analysis using hadoop cluster

From Big Data to Fast Data

"Quantum Clustering - Physics Inspired Clustering Algorithm", Sigalit Bechler...

Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...

QuantCell Research - The Big Data Spreadsheet

Obfuscating LinkedIn Member Data

Data mining tools used in business intelligence

Big Data Streams Architectures. Why? What? How?

Real-Time Big Data

Telco analytics at scale

Modern Big Data Analytics Tools: An Overview

The Rise of Engineering-Driven Analytics by Loren Shure

"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...

Big data real time architectures

Managed Cluster Services

Data pipelines from zero

Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...

Organising for Data Success

Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...

Similar to Building Analytical Apps on Hadoop

Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam

The Hadoop Ecosystem for DevelopersZohar Elkayam

Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam

Distributed data miningAhmad Ammari

Hadoop Eco systemTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Data Science Day New York: Data Science: A Personal HistoryCloudera, Inc.

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson

002 Introduction to hadoop v3Dendej Sawarnkatat

Big data hadoop-no sql and graph db-finalramazan fırın

Intro to Big DataZohar Elkayam

No SQL- The Future Of Data StorageBethmi Gunasekara

Things Every Oracle DBA Needs To Know About The Hadoop EcosystemZohar Elkayam

Getting Started with HadoopCloudera, Inc.

Lviv EDGE 2 - NoSQLzenyk

Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Zohar Elkayam

Large scale computing with mapreducehansen3032

Building Data ProductsCloudera, Inc.

Big Data trainingvishal192091

Gilbane Boston 2012 Big Data 101Peter O'Kelly

Similar to Building Analytical Apps on Hadoop (20)

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem

The Hadoop Ecosystem for Developers

Rapid Cluster Computing with Apache Spark 2016

Distributed data mining

Hadoop Eco system

Data Science Day New York: Data Science: A Personal History

Architecting Your First Big Data Implementation

Hadoop for Bioinformatics: Building a Scalable Variant Store

002 Introduction to hadoop v3

Big data hadoop-no sql and graph db-final

Intro to Big Data

No SQL- The Future Of Data Storage

Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem

Getting Started with Hadoop

Lviv EDGE 2 - NoSQL

Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527

Large scale computing with mapreduce

Building Data Products

Big Data training

Gilbane Boston 2012 Big Data 101

More from Dmitry Makarchuk

Linzer slides-barugDmitry Makarchuk

2012 11-28 rich web data modeling with graphs-1Dmitry Makarchuk

Hadoop and mysql by Chris SchneiderDmitry Makarchuk

A random forest approach to skin detection with rDmitry Makarchuk

"Your script just killed my site" by Steve SoudersDmitry Makarchuk

RBrowserPlugin Project (Gabriel Becker)Dmitry Makarchuk

Bridge to rDmitry Makarchuk

Jesse Yates: Hbase snapshots patchDmitry Makarchuk

Phoenix h basemeetupDmitry Makarchuk

Mongo DB in gaming industryDmitry Makarchuk

More from Dmitry Makarchuk (11)

Linzer slides-barug

2012 11-28 rich web data modeling with graphs-1

Hadoop and mysql by Chris Schneider

A random forest approach to skin detection with r

"Your script just killed my site" by Steve Souders

RBrowserPlugin Project (Gabriel Becker)

Bridge to r

Jesse Yates: Hbase snapshots patch

Phoenix h basemeetup

Mongo DB in gaming industry

Building Analytical Apps on Hadoop

1. Building Analytical Applications on PUBLICLY DO NOT USE Hadoop PRIOR TO 10/23/12 Headline Goes Here Josh Wills | Director of Data Science Speaker Name or Subhead Goes Here November 2012 1

2. About Me 2

3. What are ‘Analytical Applications?’ 3

4. The Humble Dashboard 4

5. Crossfilter with Flight Information 5

6. New York Times Electoral Vote Map 6

7. New York Times Electoral Vote Map (Detail) 7

8. Analytical Applications vs. Frameworks 8

9. Developing Analytical Applications A Case Study 9

10. 2012: The Predicting of the President 10

11. RealClearPolitics • Simple Average of Polls • Transparent • Simple Interactions 11

12. FiveThirtyEight • “Foxy” Model • Opaque • Simple Interactions with a richer UI 12

13. Princeton Election Consortium • Medians and Polynomials • Transparent • Rich Interactions 13

14. How Did They Do? 14

15. A Few of These, Because They’re Fun 15

16. A Few of These, Because They’re Fun 16

17. A Few of These, Because They’re Fun 17

18. Here’s the Rub: One Expert Beat Nate 18

19. Index Funds, Hedge Funds, and Warren Buffett 19

20. A Brief Introduction to Hadoop 20

21. Data Storage in 2001: Databases • Structured schemas • Intensive processing done where data is stored • Somewhat reliable • Expensive at scale 21

22. Data Storage in 2001: Filers • No schemas, stores any kind of file • No data processing capability • Reliable • Expensive at scale 22

23. And Then, This Happened 23

24. Data Economics: Return on Byte 24

25. Big Data Economics • No individual record is particularly valuable • Having every record is incredibly valuable • Web index • Recommendation systems • Sensor data • Market basket analysis • Online advertising 25

26. Introduction to Hadoop 26

27. The Hadoop Distributed File System • Based on the Google File System • Data stored in large files • Large block size: 64MB to 256MB per block • Blocks are replicated to multiple nodes in the cluster 27

28. Simple, Reliable Processing: MapReduce • Map Stage • Embarrassingly parallel • Shuffle Stage: Large-scale distributed sort • Reduce Stage • Process all of the values that have the same key in a single step • Process the data where it is stored • Write once and you’re done. 28

29. Developing Analytical Applications with Hadoop 29

30. Novelty is the Enemy of Adoption 30

31. The Best Way to Get Started: Apache Hive • Apache Hive • Data Warehouse System on top of Hadoop • SQL-based query language • SELECT, INSERT, CREATE TABLE • Includes some MapReduce- specific extensions 31

32. Borrowing Abstractions 32

33. Improving the UX (http://github.com/cloudera/impala) 33

34. Moving Beyond the Abstractions 34

35. Making the Abstract Concrete 35

36. Cloudera’s Data Science Course 36

37. Analytical Applications I Love 37

38. The Experiments Dashboard 38

39. Adverse Drug Events 39

40. Gene Sequencing and Analytics 40

41. The Doctor’s Perspective 41

42. A Couple of Themes 1. Structure data the data in the way that makes sense for the problem. 2. Interactive inputs, not just interactive outputs. 3. Simpler interfaces that yield more sophisticated answers. 42

43. Working Towards The Dream 43

44. Developing Analytical Applications Moving Beyond MapReduce 44

45. The Cambrian Explosion…of Frameworks  45

46. It’s Frameworks All The Way Down: Spark • Developed at Berkeley’s AMP Lab • Defines operations on distributed in-memory collections • Written in Scala • Supports reading to and writing from HDFS 46

47. IFATWD: Graphlab • Developed at CMU • Lower-level primitives • (but higher than MPI) • Map/Reduce => Update/Sort • Flexible, allows for asynchronous computations • Reads from HDFS 47

48. Playing with YARN 48

49. BranchReduce (http://github.com/cloudera/branchreduce) 49

50. 50

Editor's Notes

They are applications that allow users to work with and make decisions from data.
It seems like there should be a UX equivalent of Clippy– maybe like a tiny picture of Edward Tufte– that pops up whenever someone decides to use a 3D pie chart.
http://square.github.com/crossfilter/
http://elections.nytimes.com/2012/results/president (Click on “Shift from 2008”)
Click on a state to zoom in
Frameworks != Analytical applicatons, for our purposes today. It’s not an analytical application until you put some data in it.
A few different models were developed for predicting the presidency in 2012– let’s consider a few of them.
http://www.realclearpolitics.com/epolls/2012/president/2012_elections_electoral_college_map.html
http://fivethirtyeight.blogs.nytimes.com/
http://election.princeton.edu/
http://isnatesilverawitch.com/Everyone predicted the election correctly. The RCP model got every state but Florida, PEC said it was a tossup, and 538 got every single state right.
MarkosMoulitsas over at theDailyKos did even better than Nate at predicting the share of the vote within the swing states. Don’t think that math can always out-perform an expert armed with good data.http://news.cnet.com/8301-13578_3-57546778-38/among-the-top-election-quants-nate-silver-reigns-supreme/
Index fund == simple average.Hedge fund == 538Warren Buffett == Expert with good data
Classical data economics: If the value I can extract from a byte is greater than the cost to store it, then I throw it away or store it on tape.
We use metaphors that help us understand new technology in terms of the old. Translatedesktop tools and metaphors on to Hadoop, even when we’re working with specialized data types: http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/
It’s a data warehousing metaphor– not an actual data warehouse. Schema on read vs. schema on write, for example. Non-interactive for the most part. Think of ELT, not interactive queries.
We borrow these abstractions because they make it easy to get started, but they don’t necessarily conform to the user’s expectations of how Hadoop will work.If you think of Hadoop as a really big database, or as a spreadsheet that goes on forever and ever, then you have failed to understand Hadoop.
Impala is about fulfilling those abstractions, esp. for interactive queries of relational-style data on Hadoop.
But we can also go beyond the abstractions and study how Hadoop can be effective for new kinds of analytic applications.
Step 1: Study real problems. Especially real problems where non-sophisticated users (e.g., people who don’t even know SQL) need to do sophisticated analysis on large quantities of information.
I realized earlier this year that other people do not use Hive the way that I use Hive, and so we created the data science course to take people through the problem of building an analytical application from start to finish on Hadoop.http://blog.cloudera.com/blog/2012/10/data-science-training/
They are applications that allow users to work with and make decisions from data.
http://blog.cloudera.com/blog/2011/11/using-hadoop-to-analyze-adverse-drug-events/
http://www.slideshare.net/cloudera/7-leveraging-h-base-for-the-worlds-largest-curated-genomic-data-collection-satnam-alag-nextbio-finalupdatedlastminute
The truth is that building tools for unsophisticated users typically requires incredibly sophisticated development.
An open-source version of Wolfram Alpha for useful data.
https://github.com/cloudera/kitten
http://github.com/cloudera/branchreduce

Building Analytical Apps on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Analytical Apps on Hadoop

Similar to Building Analytical Apps on Hadoop (20)

More from Dmitry Makarchuk

More from Dmitry Makarchuk (11)

Building Analytical Apps on Hadoop

Editor's Notes