SlideShare ist ein Scribd-Unternehmen logo
1 von 94
Agile Data Science
Agile Analytics Applications with Hadoop

January 2014
About Me…Bearding.
• Bearding is my #1 natural talent.
• I’m going to beat this guy.
• Seriously.
• Salty Sea Beard
• Fortified with Pacific Ocean Minerals

2
Agile Data Science: The Book
A philosophy.
Not the only way,
but it’s a really good way!
Code: ‘AUTHD’ – 50% off

3
We Go Fast, But Don’t Worry!
• Download the slides - click the links - read examples!
• If it’s not on the blog (Hortonworks, Data Syndrome), it’s in
the book!
• Order now: http://shop.oreilly.com/product/0636920025054.do

4
Agile Application
Development: Check
• LAMP stack mature
• Post-Rails frameworks to choose from
• Enable rapid feedback and agility

+ NoSQL

5
Data Warehousing

6
Scientific Computing / HPC
‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop

Tubes and Mercury (Old School)

Cores and Spindles (New School)

UNIVAC and Deep Blue both fill a warehouse. We’re back!

7
Data Science?
Application
Development

Data Warehousing

Scientific Computing / HPC

8
Data Center as Computer
Warehouse Scale Computers and Applications

“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient
manner.” Click here for a paper on operating a ‘data center as computer.’
9
Hadoop to the Rescue!
• Easy to use (Pig, Hive, Cascading)
• CHEAP: 1% the cost of SAN/NAS
• A department can afford its own Hadoop cluster!
• Dump all your data in one place: Hadoop DFS
• Silos come CRASHING DOWN!
• JOIN like crazy!
• ETL like whoa!
• An army of mappers and reducers at your command
• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!

10
NOW
WHAT?
11
Analytics Apps: It takes a Team
• Broad skill-set
• Nobody has them all
• Inherently collaborative

12
Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap
• Transactional overhead dominates at 5+ people
• Expert researchers: lend 25-50% of their time to teams
• Creative workers. Like a studio, not an assembly line
• Total freedom... with goals and deliverables.
• Work environment matters most

13
How To Get Insight Into Product
• Back-end has gotten THICKER
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design
• How do you ship insights efficiently?
• Can you collaborate on research vs developer timeline?

14
The Wrong Way - Part One
“We made a great design.
Your job is to predict the future for it.”

15
The Wrong Way - Part Two
“What is taking you so long
to reliably predict the future?”

16
The Wrong Way - Part Three
“The users don’t understand
what 86% true means.”

17
The Wrong Way - Part Four
GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!!

18
The Wrong Way - Conclusion
Inevitable Conclusion

Plane

Mountain

19
Reminds me of... the waterfall
model

:(
20
Chief Problem
You can’t design insight in analytics applications.
You discover it.
You discover by exploring.

21
-> Strategy
So make an app for exploring your data.
Iterate and publish intermediate results.
Which becomes a palette for what you ship.

22
Data Design
• Not the 1st query that = insight, it’s the 15th, or 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch…
• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in wrong statistical models
• Semantics of presenting predictions are complex
• Opportunity lies at intersection of data & design

23
How Do We Get Back to Agile?

24
Statement of Principles
(Then Tricks With Code)

25
Setup An Environment Where:
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day Zero
• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship
• Until the application pays for itself and more

26
Snowballing Audience

27
Value Document > Relation

Most data is dirty. Most data is semi-structured or unstructured. Rejoice!
28
Value Document > Relation

Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
29
Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!
• Duplicate that JOIN data in one big record type!
• ETL once to document format on import, NOT every job
• Not zero JOINs, but far fewer JOINs

• Semi-structured documents preserve data’s actual structur
• Column compressed document formats beat JOINs!

30
Value Imperative > Declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. ETL.
• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize

31
Value Dataflow > SELECT

32
Ex. Dataflow: ETL +
Email Sent Count

(I can’t read this either. Get a big version here.)
33
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.

Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive.
See: HCatalog for Pig/Hive integration.
34
Localhost vs Petabyte Scale:
Same Tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3
• Everything we serve in our app is re-creatable via Hadoop.

35
Data-Value Pyramid

Climb it. Do not skip steps. See here.
36
0/1) Display Atomic Records
On The Web

37
0.0) Document - Serialize Events
• Protobuf
• Thrift
• JSON
• Avro - I use Avro because the schema is onboard.

38
0.1) Documents Via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
message_id:chararray,
sql_date:chararray,
from_address:chararray,
from_name:chararray,
subject:chararray,
body:chararray);
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray,
name:chararray);
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray,
name:chararray),

enron_messages::subject as subject,
enron_messages::body as body,
headers::tos.(address, name) as tos,
headers::ccs.(address, name) as ccs,
headers::bccs.(address, name) as bccs;

store emails into '/enron/emails.avro' using AvroStorage(

Example here.
39
0.2) Serialize Events From
Streams

class GmailSlurper(object):
...
def init_imap(self, username, password):
self.username = username
self.password = password
try:
imap.shutdown()
except:
pass
self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
self.imap.login(username, password)
self.imap.is_readonly = True
...
def write(self, record):
self.avro_writer.append(record)
...
def slurp(self):
if(self.imap and self.imap_folder):
for email_id in self.id_list:
(status, email_hash, charset) = self.fetch_email(email_id)
if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
print email_id, charset, email_hash['thread_id']
self.write(email_hash)

Scrape your own gmail in Python and Ruby.
40
0.3) ETL Logs
log_data = LOAD 'access_log'
USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
AS (remoteAddr,
remoteLogname,
user,
time,
method,
uri,
proto,
bytes);

41
1) Plumb Atomic Events->Browser

(Example stack that enables high productivity)
42
1.1) Cat Avro Serialized Events
me$ cat_avro ~/Data/enron.avro
{

u'bccs': [],
u'body': u'scamming people, blah blah',
u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z',
u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
u'subject': u'Re: Enron trade for frop futures',
u'tos': [
{u'address': u'connie@enron.com', u'name': None}
]

}

Get cat_avro in python, ruby
43
1.2) Load Events in Pig
me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails
emails: {
message_id: chararray,
datetime: chararray,
from:tuple(address:chararray,name:chararray)
subject: chararray,
body: chararray,
tos: {to: (address: chararray,name: chararray)},
ccs: {cc: (address: chararray,name: chararray)},
bccs: {bcc: (address: chararray,name: chararray)}
}
 

44
1.3) ILLUSTRATE Events in Pig
grunt> illustrate enron_emails
 --------------------------------------------------------------------------| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
--------------------------------------------------------------------------|
|
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |

Upgrade to Pig 0.10+
45
1.4) Publish Events to a ‘Database’
From Avro to MongoDB in one command:
pig -l /tmp -x local -v -w -param avros=enron.avro 
-param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig

Which does this:
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */
/* By default, lets have 5 reducers */
set default_parallel 5
avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();

Full instructions here.
46
1.5) Check Events in ‘Database’
$ mongo enron
MongoDB shell version: 2.0.2
connecting to: enron
show collections
Emails
system.indexes
>db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
"_id" : ObjectId("502b4ae703643a6a49c8d180"),
"message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
"date" : "2001-01-09T06:38:00.000Z",
"from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
"subject" : Re: Enron trade for frop futures,
"body" : "Scamming more people...",
"tos" : [ { "address" : "connie@enron", "name" : null } ],
"ccs" : [ ],
"bccs" : [ ]
}

47
1.6) Publish Events on the Web
require 'rubygems'
require 'sinatra'
require 'mongo'
require 'json'
connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']
get '/email/:message_id' do |message_id|
data = collection.find_one({:message_id => message_id})
JSON.generate(data)
end

48
1.6) Publish events on the web

49
One-Liner to Transition Stack

50
What’s the Point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.
• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!
• Feels agile, don’t it? Keep it up!

51
1.7) Wrap Events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
<table class="table table-striped table-bordered table-condensed">
<thead>
{% for key in data['keys'] %}
<th>{{ key }}</th>
{% endfor %}
</thead>
<tbody>
<tr>
{% for value in data['values'] %}
<td>{{ value }}</td>
{% endfor %}
</tr>
</tbody>
</table>
</div>
</body>

Complete example here with code here.
52
1.7) Wrap Events with Bootstrap

53
Refine. Add Links
Between Documents.

Not the Mona Lisa, but coming along... See: here
54
1.8) List Links to Sorted Events
Use Pig, serve/cache a bag/array of email documents:
pig -l /tmp -x local -v -w
emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};
store emails_per_user into '$mongourl' using MongoStorage();

Use your ‘database’, if it can sort.
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
{
{
"_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
"message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
"from" : [
...

56
1.8) List Links
to Sorted Documents

57
1.9) Make It Searchable
If you have list, search is easy with
ElasticSearch and Wonderdog...
/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();
emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');

Test it with curl:
curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'

ElasticSearch has no security features. Take note. Isolate.
58
2) Create Simple Charts

59
2) Create Simple Tables and
Charts

60
2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.
• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.

61
2.1) Top N (of Anything) in Pig
pig -l /tmp -x local -v -w
top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();

Remember, this is the same structure the browser gets as json.
This would make a good Pig Macro.
62
2.2) Time Series (of Anything) in
Pig
pig -l /tmp -x local -v -w
/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
COUNT_STAR(things) as total;
/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};
store things_timeseries into '$mongourl' using MongoStorage();

Yet another good Pig Macro.
63
Data Processing in Our Stack
A new feature in our application might begin at any layer…
GREAT!

I’m creative!
I know Pig!

I’m creative too!
I <3 Javascript!

omghi2u!
where r my legs?
send halp

Any team member can add new features, no problemo!
64
Data Processing in Our Stack
... but we shift the data-processing towards batch, as we are able.

See real example here.
Ex: Overall total emails calculated in each layer
65
3) Exploring with Reports

66
3) Exploring with Reports

67
3.0) From Charts to Reports
• Extract entities from properties we aggregated by in charts (Step 2)
• Each entity gets its own type of web page
• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)
• Link most related entities together, same and between types.
• More visualizations!
• Parametize results via forms.

68
3.1) Looks Like This:

69
3.2) Cultivate Common Keyspaces

70
3.3) Get People Clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.
• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.

‘People’ could be just your team, if data is sensitive.
71
4) Predictions and
Recommendations

72
4.0) Preparation
• We’ve already extracted entities, their properties and relationships
• Our charts show where our signal is rich
• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data
• They got that understanding by exploring the data
• We are all on the same page!

73
4.2) Think in Different
Perspectives
• Networks
• Time Series / Distributions
• Natural Language Processing
• Conditional Probabilities / Bayesian Inference
• Check out Chapter 2 of the book

74
4.3) Networks

75
4.3.1) Weighted Email
Networks in Pig

76
4.3.2) Networks Viz with Gephi

77
4.3.3) Gephi = Easy

78
4.3.4) Social Network Analysis

79
4.4) Time Series & Distributions

80
4.4.1) Smooth Sparse Data

See here.

81
4.4.2) Regress to Find Trends
JRuby Linear Regression UDF

Pig to use the UDF

Trend Line in your Application

82
4.5.1) Natural Language
Processing

Example with code here and macro here.
83
4.5.2) NLP: Extract Topics!

84
4.5.3) NLP for All: Extract Topics!
• TF-IDF in Pig - 2 lines of code with Pig Macros:
• http://hortonworks.com/blog/pig-macro-for-tf-idf-makestopic-summarization-2-lines-of-pig/
• LDA with Pig and the Lucene Tokenizer:
• http://thedatachef.blogspot.be/2012/03/topic-discoverywith-apache-pig-and.html

85
4.6) Probability & Bayesian
Inference

86
4.6.1) Gmail Suggested Recipients

87
4.6.1) Reproducing it with Pig

88
4.6.2) Step 1: COUNT (From -> To)

89
4.6.2) Step 2: COUNT
(From, To, Cc)/Total

P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone

90
4.6.3) Wait - Stop Here! It Works!

They match…
91
4.4) Add Predictions to Reports

92
5) Enable New Actions

93
Why Doesn’t Kate Reply
to My Emails?
• What time is best to catch her?
• Are they too long?
• Are they meant to be replied to (original content)?
• Are they nice? (sentiment analysis)
• Do I reply to her emails (reciprocity)?
• Do I cc the wrong people (my mom)?

94
Thank You!
•Questions & Answers

• Follow: @rjurney
• Read the Blog: datasyndrome.com

97

Weitere ähnliche Inhalte

Was ist angesagt?

from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_science
Martina Pugliese
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Vignesh Prajapati
 

Was ist angesagt? (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data science 101
Data science 101Data science 101
Data science 101
 
from_physics_to_data_science
from_physics_to_data_sciencefrom_physics_to_data_science
from_physics_to_data_science
 
Data Science
Data ScienceData Science
Data Science
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
8 minute intro to data science
8 minute intro to data science 8 minute intro to data science
8 minute intro to data science
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum ShachamH2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
H2O World - Data Science w/ Big Data in a Corporate Environment - Nachum Shacham
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Intro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data ScientistsIntro to Data Science for Non-Data Scientists
Intro to Data Science for Non-Data Scientists
 

Ähnlich wie Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
Hortonworks
 

Ähnlich wie Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014 (20)

Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile analytics applications on hadoop
Agile analytics applications on hadoopAgile analytics applications on hadoop
Agile analytics applications on hadoop
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDP
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 

Mehr von The Hive

The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive
 

Mehr von The Hive (20)

"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead"Responsible AI", by Charlie Muirhead
"Responsible AI", by Charlie Muirhead
 
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...
 
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoTDigital Transformation; Digital Twins for Delivering Business Value in IIoT
Digital Transformation; Digital Twins for Delivering Business Value in IIoT
 
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18
 
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
 
Data Science in the Enterprise
Data Science in the EnterpriseData Science in the Enterprise
Data Science in the Enterprise
 
AI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the EnterpriseAI in Software for Augmenting Intelligence Across the Enterprise
AI in Software for Augmenting Intelligence Across the Enterprise
 
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...
 
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
"The Future of Manufacturing" by Sujeet Chand, SVP&CTO, Rockwell Automation
 
Social Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve OmohundroSocial Impact & Ethics of AI by Steve Omohundro
Social Impact & Ethics of AI by Steve Omohundro
 
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
The Hive Think Tank: AI in The Enterprise by Venkat Srinivasan
 
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...
 
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive Think Tank: The Future Of Customer Support - AI Driven Automation
The Hive Think Tank: The Future Of Customer Support - AI Driven Automation
 
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...
 
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital Change
 
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra MalikDeep Visual Understanding from Deep Learning by Prof. Jitendra Malik
Deep Visual Understanding from Deep Learning by Prof. Jitendra Malik
 
The Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at TwitterThe Hive Think Tank: Heron at Twitter
The Hive Think Tank: Heron at Twitter
 
The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare The Hive Think Tank: Unpacking AI for Healthcare
The Hive Think Tank: Unpacking AI for Healthcare
 
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...
 
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014

  • 1. Agile Data Science Agile Analytics Applications with Hadoop January 2014
  • 2. About Me…Bearding. • Bearding is my #1 natural talent. • I’m going to beat this guy. • Seriously. • Salty Sea Beard • Fortified with Pacific Ocean Minerals 2
  • 3. Agile Data Science: The Book A philosophy. Not the only way, but it’s a really good way! Code: ‘AUTHD’ – 50% off 3
  • 4. We Go Fast, But Don’t Worry! • Download the slides - click the links - read examples! • If it’s not on the blog (Hortonworks, Data Syndrome), it’s in the book! • Order now: http://shop.oreilly.com/product/0636920025054.do 4
  • 5. Agile Application Development: Check • LAMP stack mature • Post-Rails frameworks to choose from • Enable rapid feedback and agility + NoSQL 5
  • 7. Scientific Computing / HPC ‘Smart Kid’ Only: MPI, Globus, etc. Until Hadoop Tubes and Mercury (Old School) Cores and Spindles (New School) UNIVAC and Deep Blue both fill a warehouse. We’re back! 7
  • 9. Data Center as Computer Warehouse Scale Computers and Applications “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’ 9
  • 10. Hadoop to the Rescue! • Easy to use (Pig, Hive, Cascading) • CHEAP: 1% the cost of SAN/NAS • A department can afford its own Hadoop cluster! • Dump all your data in one place: Hadoop DFS • Silos come CRASHING DOWN! • JOIN like crazy! • ETL like whoa! • An army of mappers and reducers at your command • OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME! 10
  • 12. Analytics Apps: It takes a Team • Broad skill-set • Nobody has them all • Inherently collaborative 12
  • 13. Data Science Team • 3-4 team members with broad, diverse skill-sets that overlap • Transactional overhead dominates at 5+ people • Expert researchers: lend 25-50% of their time to teams • Creative workers. Like a studio, not an assembly line • Total freedom... with goals and deliverables. • Work environment matters most 13
  • 14. How To Get Insight Into Product • Back-end has gotten THICKER • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • Can you collaborate on research vs developer timeline? 14
  • 15. The Wrong Way - Part One “We made a great design. Your job is to predict the future for it.” 15
  • 16. The Wrong Way - Part Two “What is taking you so long to reliably predict the future?” 16
  • 17. The Wrong Way - Part Three “The users don’t understand what 86% true means.” 17
  • 18. The Wrong Way - Part Four GHJIAEHGIEhjagigehganb!!!!!RJ(@J?!! 18
  • 19. The Wrong Way - Conclusion Inevitable Conclusion Plane Mountain 19
  • 20. Reminds me of... the waterfall model :( 20
  • 21. Chief Problem You can’t design insight in analytics applications. You discover it. You discover by exploring. 21
  • 22. -> Strategy So make an app for exploring your data. Iterate and publish intermediate results. Which becomes a palette for what you ship. 22
  • 23. Data Design • Not the 1st query that = insight, it’s the 15th, or 150th • Capturing “Ah ha!” moments • Slow to do those in batch… • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in wrong statistical models • Semantics of presenting predictions are complex • Opportunity lies at intersection of data & design 23
  • 24. How Do We Get Back to Agile? 24
  • 25. Statement of Principles (Then Tricks With Code) 25
  • 26. Setup An Environment Where: • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day Zero • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more 26
  • 28. Value Document > Relation Most data is dirty. Most data is semi-structured or unstructured. Rejoice! 28
  • 29. Value Document > Relation Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. 29
  • 30. Relational Data = Legacy Format • Why JOIN? Storage is fundamentally cheap! • Duplicate that JOIN data in one big record type! • ETL once to document format on import, NOT every job • Not zero JOINs, but far fewer JOINs • Semi-structured documents preserve data’s actual structur • Column compressed document formats beat JOINs! 30
  • 31. Value Imperative > Declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize 31
  • 32. Value Dataflow > SELECT 32
  • 33. Ex. Dataflow: ETL + Email Sent Count (I can’t read this either. Get a big version here.) 33
  • 34. Value Pig > Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. Actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive. See: HCatalog for Pig/Hive integration. 34
  • 35. Localhost vs Petabyte Scale: Same Tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3 • Everything we serve in our app is re-creatable via Hadoop. 35
  • 36. Data-Value Pyramid Climb it. Do not skip steps. See here. 36
  • 37. 0/1) Display Atomic Records On The Web 37
  • 38. 0.0) Document - Serialize Events • Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. 38
  • 39. 0.1) Documents Via Relation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray); enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray); split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc'; headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. 39
  • 40. 0.2) Serialize Events From Streams class GmailSlurper(object): ... def init_imap(self, username, password): self.username = username self.password = password try: imap.shutdown() except: pass self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993) self.imap.login(username, password) self.imap.is_readonly = True ... def write(self, record): self.avro_writer.append(record) ... def slurp(self): if(self.imap and self.imap_folder): for email_id in self.id_list: (status, email_hash, charset) = self.fetch_email(email_id) if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash): print email_id, charset, email_hash['thread_id'] self.write(email_hash) Scrape your own gmail in Python and Ruby. 40
  • 41. 0.3) ETL Logs log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); 41
  • 42. 1) Plumb Atomic Events->Browser (Example stack that enables high productivity) 42
  • 43. 1.1) Cat Avro Serialized Events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } Get cat_avro in python, ruby 43
  • 44. 1.2) Load Events in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} }   44
  • 45. 1.3) ILLUSTRATE Events in Pig grunt> illustrate enron_emails  --------------------------------------------------------------------------| emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------| | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ 45
  • 46. 1.4) Publish Events to a ‘Database’ From Avro to MongoDB in one command: pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig Which does this: /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); Full instructions here. 46
  • 47. 1.5) Check Events in ‘Database’ $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron show collections Emails system.indexes >db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { "_id" : ObjectId("502b4ae703643a6a49c8d180"), "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", "date" : "2001-01-09T06:38:00.000Z", "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, "subject" : Re: Enron trade for frop futures, "body" : "Scamming more people...", "tos" : [ { "address" : "connie@enron", "name" : null } ], "ccs" : [ ], "bccs" : [ ] } 47
  • 48. 1.6) Publish Events on the Web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end 48
  • 49. 1.6) Publish events on the web 49
  • 51. What’s the Point? • A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! 51
  • 52. 1.7) Wrap Events with Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. 52
  • 53. 1.7) Wrap Events with Bootstrap 53
  • 54. Refine. Add Links Between Documents. Not the Mona Lisa, but coming along... See: here 54
  • 55. 1.8) List Links to Sorted Events Use Pig, serve/cache a bag/array of email documents: pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use your ‘database’, if it can sort. mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", "from" : [ ... 56
  • 56. 1.8) List Links to Sorted Documents 57
  • 57. 1.9) Make It Searchable If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); Test it with curl: curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' ElasticSearch has no security features. Take note. Isolate. 58
  • 58. 2) Create Simple Charts 59
  • 59. 2) Create Simple Tables and Charts 60
  • 60. 2) Create Simple Charts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. 61
  • 61. 2.1) Top N (of Anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. 62
  • 62. 2.2) Time Series (of Anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. 63
  • 63. Data Processing in Our Stack A new feature in our application might begin at any layer… GREAT! I’m creative! I know Pig! I’m creative too! I <3 Javascript! omghi2u! where r my legs? send halp Any team member can add new features, no problemo! 64
  • 64. Data Processing in Our Stack ... but we shift the data-processing towards batch, as we are able. See real example here. Ex: Overall total emails calculated in each layer 65
  • 65. 3) Exploring with Reports 66
  • 66. 3) Exploring with Reports 67
  • 67. 3.0) From Charts to Reports • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. 68
  • 68. 3.1) Looks Like This: 69
  • 69. 3.2) Cultivate Common Keyspaces 70
  • 70. 3.3) Get People Clicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. 71
  • 72. 4.0) Preparation • We’ve already extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! 73
  • 73. 4.2) Think in Different Perspectives • Networks • Time Series / Distributions • Natural Language Processing • Conditional Probabilities / Bayesian Inference • Check out Chapter 2 of the book 74
  • 76. 4.3.2) Networks Viz with Gephi 77
  • 77. 4.3.3) Gephi = Easy 78
  • 78. 4.3.4) Social Network Analysis 79
  • 79. 4.4) Time Series & Distributions 80
  • 80. 4.4.1) Smooth Sparse Data See here. 81
  • 81. 4.4.2) Regress to Find Trends JRuby Linear Regression UDF Pig to use the UDF Trend Line in your Application 82
  • 82. 4.5.1) Natural Language Processing Example with code here and macro here. 83
  • 83. 4.5.2) NLP: Extract Topics! 84
  • 84. 4.5.3) NLP for All: Extract Topics! • TF-IDF in Pig - 2 lines of code with Pig Macros: • http://hortonworks.com/blog/pig-macro-for-tf-idf-makestopic-summarization-2-lines-of-pig/ • LDA with Pig and the Lucene Tokenizer: • http://thedatachef.blogspot.be/2012/03/topic-discoverywith-apache-pig-and.html 85
  • 85. 4.6) Probability & Bayesian Inference 86
  • 86. 4.6.1) Gmail Suggested Recipients 87
  • 87. 4.6.1) Reproducing it with Pig 88
  • 88. 4.6.2) Step 1: COUNT (From -> To) 89
  • 89. 4.6.2) Step 2: COUNT (From, To, Cc)/Total P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone 90
  • 90. 4.6.3) Wait - Stop Here! It Works! They match… 91
  • 91. 4.4) Add Predictions to Reports 92
  • 92. 5) Enable New Actions 93
  • 93. Why Doesn’t Kate Reply to My Emails? • What time is best to catch her? • Are they too long? • Are they meant to be replied to (original content)? • Are they nice? (sentiment analysis) • Do I reply to her emails (reciprocity)? • Do I cc the wrong people (my mom)? 94
  • 94. Thank You! •Questions & Answers • Follow: @rjurney • Read the Blog: datasyndrome.com 97