SlideShare ist ein Scribd-Unternehmen logo
1 von 106
Agile Analytics Applications
on HDP
Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks

Formerly Viz, Data Science at Ning, LinkedIn

HBase Dashboards, Career Explorer, InMaps




© Hortonworks Inc. 2012
                                                             2
About me... Bearding.

• I’m going to beat this guy


• Seriously


• Bearding is my #1 natural talent


• Salty Sea Beard


• Fortified with Pacific Ocean Minerals
         © Hortonworks Inc. 2012          3
Agile Data - The Book (July, 2013)


                              Read on Safari Rough Cuts


                                 Early Release Here


                                    Code Here


    © Hortonworks Inc. 2012                           4
We go fast... but don’t worry!
• Examples for EVERYTHING on the Hortonworks blog:
  http://hortonworks.com/blog/authors/russell_jurney

• Download the slides - click the links - read examples!

• If its not on the blog, its in the book!

• Order now: http://shop.oreilly.com/product/0636920025054.do

• Read the book Friday on Safari Rough Cuts



        © Hortonworks Inc. 2012                            5
HDP Sandbox - Talk Lessons Coming!




   © Hortonworks Inc. 2012           6
Agile Application Development: Check
• LAMP stack mature
• Post-Rails frameworks to choose from
• Enable rapid feedback and agility




                                   + NoSQL



      © Hortonworks Inc. 2012                8
Data Warehousing




   © Hortonworks Inc. 2012   9
Scientific Computing / HPC

  • ‘Smart kid’ only: MPI, Globus, etc. until Hadoop




Tubes and Mercury (old school)      Cores and Spindles (new school)

         UNIVAC and Deep Blue both fill a warehouse. We’re back...

          © Hortonworks Inc. 2012                                     10
Data Science?

 Application
                                                            Data Warehousing
Development




                               Scientific Computing / HPC
     © Hortonworks Inc. 2012                                                   11
Data Center as Computer
     • Warehouse Scale Computers and applications




“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.”
Click here for a paper on operating a ‘data center as computer.’

               © Hortonworks Inc. 2012                                                                 12
© Hortonworks Inc. 2012   13
© Hortonworks Inc. 2012   14
© Hortonworks Inc. 2012   15
© Hortonworks Inc. 2012   16
© Hortonworks Inc. 2012   17
18
Tez – Faster MapReduce!




                          19
Hadoop to the Rescue!




   © Hortonworks Inc. 2012   20
Hadoop to the Rescue!

• Easy to use! (Pig, Hive, Cascading)

• CHEAP: 1% the cost of SAN/NAS

• A department can afford its own Hadoop cluster!

• Dump all your data in one place: Hadoop DFS

• Silos come CRASHING DOWN!

• JOIN like crazy!

• ETL like whoah!

• An army of mappers and reducers at your command

• OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME!
      © Hortonworks Inc. 2012                       21
NOW WHAT?




  © Hortonworks Inc. 2012
                            ?   22
Analytics Apps: It takes a Team
• Broad skill-set
• Nobody has them all
• Inherently collaborative




      © Hortonworks Inc. 2012     23
Data Science Team
• 3-4 team members with broad, diverse skill-sets that overlap

• Transactional overhead dominates at 5+ people

• Expert researchers: lend 25-50% of their time to teams

• Creative workers. Run like a studio, not an assembly line

• Total freedom... with goals and deliverables.

• Work environment matters most

         © Hortonworks Inc. 2012                                 24
How to get insight into product?

• Back-end has gotten t-h-i-c-k-e-r



• Generating $$$ insight can take 10-100x app dev



• Timeline disjoint: analytics vs agile app-dev/design



• How do you ship insights efficiently?



• How do you collaborate on research vs developer timeline?
        © Hortonworks Inc. 2012                               25
The Wrong Way - Part One




“We made a great design. Your job is to predict the future for it.”




        © Hortonworks Inc. 2012                                 26
The Wrong Way - Part Two




 “Whats taking you so long to reliably predict the future?”




     © Hortonworks Inc. 2012                                  27
The Wrong Way - Part Three




  “The users don’t understand what 86% true means.”




   © Hortonworks Inc. 2012                            28
The Wrong Way - Part Four




 GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!




   © Hortonworks Inc. 2012                          29
The Wrong Way - Inevitable Conclusion




                              Plane   Mountain


    © Hortonworks Inc. 2012                      30
Reminds me of... the waterfall model




    © Hortonworks Inc. 2012   :(       31
Chief Problem


You can’t design insight in analytics applications.


                               You discover it.


                       You discover by exploring.


     © Hortonworks Inc. 2012                        32
-> Strategy


   So make an app for exploring your data.


  Iterate and publish intermediate results.


 Which becomes a palette for what you ship.


   © Hortonworks Inc. 2012                    33
Data Design

• Not the 1st query that = insight, its the 15th, or the 150th

• Capturing “Ah ha!” moments

• Slow to do those in batch...

• Faster, better context in an interactive web application.

• Pre-designed charts wind up terrible. So bad.

• Easy to invest man-years in the wrong statistical models

• Semantics of presenting predictions are complex, delicate

• Opportunity lies at intersection of data & design
      © Hortonworks Inc. 2012                                    34
How do we get back to Agile?




   © Hortonworks Inc. 2012     35
Statement of Principles




                              (then tricks, with code)




    © Hortonworks Inc. 2012                              36
Setup an environment where...

• Insights repeatedly produced

• Iterative work shared with entire team

• Interactive from day 0

• Data model is consistent end-to-end

• Minimal impedance between layers

• Scope and depth of insights grow

• Insights form the palette for what you ship

• Until the application pays for itself and more
      © Hortonworks Inc. 2012                      37
Value document > relation




Most data is dirty. Most data is semi-structured or un-structured. Rejoice!
         © Hortonworks Inc. 2012                                        38
Value document > relation




Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
          © Hortonworks Inc. 2012                                               39
Relational Data = Legacy Format
• Why JOIN? Storage is fundamentally cheap!

• Duplicate that JOIN data in one big record type!

• ETL once to document format on import, NOT every job

• Not zero JOINs, but far fewer JOINs

• Semi-structured documents preserve data’s actual structure

• Column compressed document formats beat JOINs! (paper
  coming)
         © Hortonworks Inc. 2012                               40
Value imperative > declarative

• We don’t know what we want to SELECT.

• Data is dirty - check each step, clean iteratively.

• 85% of data scientist’s time spent munging. See: ETL.

• Imperative is optimized for our process.

• Process = iterative, snowballing insight

• Efficiency matters, self optimize




      © Hortonworks Inc. 2012                             41
Value dataflow > SELECT




   © Hortonworks Inc. 2012   42
Ex. dataflow: ETL + email sent count




                               (I can’t read this either. Get a big version here
     © Hortonworks Inc. 2012                                                       43
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive...
                 See: HCatalog for Pig/Hive integration, and this post.

         © Hortonworks Inc. 2012                                                      44
Localhost vs Petabyte scale: same tools
tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with
  documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3.
• Everything we serve in our app is re-creatable via Hadoop.

      © Hortonworks Inc. 2012                                   45
Data-Value Pyramid




               Climb it. Do not skip steps. See here.

   © Hortonworks Inc. 2012                              46
0/1) Display atomic records on the web




    © Hortonworks Inc. 2012              47
0.0) Document-serialize events

• Protobuf

• Thrift

• JSON

• Avro - I use Avro because the schema is onboard.




       © Hortonworks Inc. 2012                       48
0.1) Documents via Relation ETL

enron_messages = load '/enron/enron_messages.tsv' as (
     message_id:chararray,
     sql_date:chararray,
     from_address:chararray,
     from_name:chararray,
     subject:chararray,
     body:chararray
);


enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);


split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';


headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
                                      CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
                                      TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),
                                      enron_messages::subject as subject,
                                      enron_messages::body as body,
                                      headers::tos.(address, name) as tos,
                                      headers::ccs.(address, name) as ccs,
                                      headers::bccs.(address, name) as bccs;


store emails into '/enron/emails.avro' using AvroStorage(
                                                                                                                    Example here.
                      © Hortonworks Inc. 2012                                                                                                  49
0.2) Serialize events from streams
class GmailSlurper(object):
  ...
  def init_imap(self, username, password):
    self.username = username
    self.password = password
    try:
      imap.shutdown()
    except:
      pass
    self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
    self.imap.login(username, password)
    self.imap.is_readonly = True
  ...
  def write(self, record):
    self.avro_writer.append(record)
  ...
  def slurp(self):
    if(self.imap and self.imap_folder):
      for email_id in self.id_list:
        (status, email_hash, charset) = self.fetch_email(email_id)
        if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
          print email_id, charset, email_hash['thread_id']
          self.write(email_hash)




            © Hortonworks Inc. 2012   Scrape your own gmail in Python and Ruby.            50
0.3) ETL Logs



log_data = LOAD 'access_log'
   USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
   AS (remoteAddr,
       remoteLogname,
       user,
       time,
       method,
       uri,
       proto,
       bytes);




      © Hortonworks Inc. 2012                                          51
1) Plumb atomic events -> browser




     (Example stack that enables high productivity)

   © Hortonworks Inc. 2012                            52
Lots of Stack Options with Examples

• Pig with Voldemort, Ruby, Sinatra: example

• Pig with ElasticSearch: example

• Pig with MongoDB, Node.js: example

• Pig with Cassandra, Python Streaming, Flask: example

• Pig with HBase, JRuby, Sinatra: example

• Pig with Hive via HCatalog: example (trivial on HDP)

• Up next: Accumulo, Redis, MySQL, etc.



      © Hortonworks Inc. 2012                            53
1.1) cat our Avro serialized events

me$ cat_avro ~/Data/enron.avro

{
  u'bccs': [], u'body': u'scamming people, blah blah', u'ccs':
28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com'
u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u
trade for frop futures', u'tos': [
    {u'address': u'connie@enron.com', u'name': None}
  ]
}




      © Hortonworks Inc. 2012   Get cat_avro in python, ruby   54
1.2) Load our events in Pig

me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails
emails: {
  message_id: chararray,
  datetime: chararray,
  from:tuple(address:chararray,name:chararray)
  subject: chararray,
  body: chararray,
  tos: {to: (address: chararray,name: chararray)},
  ccs: {cc: (address: chararray,name: chararray)},
  bccs: {bcc: (address: chararray,name: chararray)}
}




       © Hortonworks Inc. 2012                                    55
1.3) ILLUSTRATE our events in Pig
grunt> illustrate enron_emails

---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |




tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
|        |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
                                                   Upgrade to Pig 0.10+
         © Hortonworks Inc. 2012                                              56
1.4) Publish our events to a ‘database’
From Avro to MongoDB in one command:
pig -l /tmp -x local -v -w -param avros=enron.avro 
   -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig


Which does this:



/* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarre
hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig
SNAPSHOT.jar/* Set speculative execution off to avoid chance of duplicate records in
mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.exec
com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut *//* By default, lets have 5 redu
= load '$avros' using AvroStorage();store avros into '$mongourl' using MongoStorage(




          © Hortonworks Inc. 2012                Full instructions here.     57
1.5) Check events in our ‘database’




$ mongo enronMongoDB shell version: 2.0.2connecting to: enron> show collectionsemail
db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){
ObjectId("502b4ae703643a6a49c8d180"),       "message_id" : "<1731.10095812390082.JavaM
"2001-01-09T06:38:00.000Z",        "from" : { "address" : "bob.dobbs@enron.com", "nam
"subject" : Re: Enron trade for frop futures,        "body" : "Scamming more people...
"connie@enron", "name" : null } ], "ccs" : [ ],      "bccs" : [ ]}




          © Hortonworks Inc. 2012                                              58
1.6) Publish events on the web



require 'rubygems'
require 'sinatra'
require 'mongo'
require 'json'
connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']
get '/email/:message_id' do |message_id|
  data = collection.find_one({:message_id => message_id})
  JSON.generate(data)
end




       © Hortonworks Inc. 2012                        59
1.6) Publish events on the web




    © Hortonworks Inc. 2012      60
Whats the point?

• A designer can work against real data.

• An application developer can work against real data.

• A product manager can think in terms of real data.

• Entire team is grounded in reality!

• You’ll see how ugly your data really is.

• You’ll see how much work you have yet to do.

• Ship early and often!

• Feels agile, don’t it? Keep it up!

      © Hortonworks Inc. 2012                            61
1.7) Wrap events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
  <table class="table table-striped table-bordered table-condensed">
    <thead>
    {% for key in data['keys'] %}
      <th>{{ key }}</th>
    {% endfor %}
    </thead>
    <tbody>
      <tr>
      {% for value in data['values'] %}
          <td>{{ value }}</td>
      {% endfor %}
      </tr>
    </tbody>
  </table>
</div>                              Complete example here with code here.
</body>

          © Hortonworks Inc. 2012                                                62
1.7) Wrap events with Bootstrap




    © Hortonworks Inc. 2012       63
Refine. Add links between documents.




                       Not the Mona Lisa, but coming along... See: here
   © Hortonworks Inc. 2012                                                64
1.8) List links to sorted events
Use Pig, serve/cache a bag/array of email documents:
pig -l /tmp -x local -v -w


emails_per_user = foreach (group emails by from.address) {
sorted = order emails by date;
last_1000 = limit sorted 1000;
generate group as from_address, emails as emails;
};


store emails_per_user into '$mongourl' using MongoStorage();

Use your ‘database’, if it can sort.
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
  {
        {
            "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
            "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
            "from" : [
  ...

              © Hortonworks Inc. 2012                                                        66
1.8) List links to sorted documents




    © Hortonworks Inc. 2012           67
1.9) Make it searchable...
If you have list, search is easy with ElasticSearch and Wonderdog...

/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();


emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using
ElasticSearch('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-
0.18.6/plugins');


Test it with curl:
 curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'



ElasticSearch has no security features. Take note. Isolate.

          © Hortonworks Inc. 2012                                                     68
From now on we speed up...




        Don’t worry, its in the book and on the blog.

                             http://hortonworks.com/blog/




   © Hortonworks Inc. 2012                                  69
2) Create Simple Charts




   © Hortonworks Inc. 2012   70
2) Create Simple Tables and Charts




   © Hortonworks Inc. 2012           71
2) Create Simple Charts

• Start with an HTML table on general principle.

• Then use nvd3.js - reusable charts for d3.js

• Aggregate by properties & displaying is first step in entity
 resolution
• Start extracting entities. Ex: people, places, topics, time series

• Group documents by entities, rank and count.

• Publish top N, time series, etc.

• Fill a page with charts.

• Add a chart to your event page.
        © Hortonworks Inc. 2012                                  72
2.1) Top N (of anything) in Pig

pig -l /tmp -x local -v -w

top_things = foreach (group things by key) {
sorted = order things by arbitrary_rank desc;
top_10_things = limit sorted 10;
generate group as key, top_10_things as top_10_things;
};
store top_n into '$mongourl' using MongoStorage();




Remember, this is the same structure the browser gets as json.

                       This would make a good Pig Macro.

       © Hortonworks Inc. 2012                             73
2.2) Time Series (of anything) in Pig
pig -l /tmp -x local -v -w

/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
generate flatten(group) as (key, month),
            COUNT_STAR(things) as total;

/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
timeseries = order things by month;
generate group as key, timeseries as timeseries;
};

store things_timeseries into '$mongourl' using MongoStorage();



                                  Yet another good Pig Macro.


        © Hortonworks Inc. 2012                                    74
Data processing in our stack

A new feature in our application might begin at any layer... great!




                                                                        omghi2u!
           I’m creative!                         I’m creative too!
                                                                     where r my legs?
            I know Pig!                          I <3 Javascript!
                                                                        send halp




    Any team member can add new features, no problemo!
        © Hortonworks Inc. 2012                                            75
Data processing in our stack
... but we shift the data-processing towards batch, as we are able.




                                                        See real example here.
                               Ex: Overall total emails calculated in each layer
         © Hortonworks Inc. 2012                                             76
3) Exploring with Reports




    © Hortonworks Inc. 2012   77
3) Exploring with Reports




    © Hortonworks Inc. 2012   78
3.0) From charts to reports...

• Extract entities from properties we aggregated by in charts (Step 2)

• Each entity gets its own type of web page

• Each unique entity gets its own web page

• Link to entities as they appear in atomic event documents (Step 1)

• Link most related entities together, same and between types.

• More visualizations!

• Parametize results via forms.



          © Hortonworks Inc. 2012                                  79
3.1) Looks like this...




    © Hortonworks Inc. 2012   80
3.2) Cultivate common keyspaces




   © Hortonworks Inc. 2012        81
3.3) Get people clicking. Learn.

• Explore this web of generated pages, charts and links!

• Everyone on the team gets to know your data.

• Keep trying out different charts, metrics, entities, links.

• See whats interesting.

• Figure out what data needs cleaning and clean it.

• Start thinking about predictions & recommendations.


    ‘People’ could be just your team, if data is sensitive.

      © Hortonworks Inc. 2012                                   82
4) Predictions and Recommendations




   © Hortonworks Inc. 2012           83
4.0) Preparation

• We’ve already extracted entities, their properties and relationships

• Our charts show where our signal is rich

• We’ve cleaned our data to make it presentable

• The entire team has an intuitive understanding of the data

• They got that understanding by exploring the data

• We are all on the same page!




         © Hortonworks Inc. 2012                                  84
4.2) Think in different perspectives

• Networks



• Time Series / Distributions



• Natural Language Processing



• Conditional Probabilities / Bayesian Inference



• Check out Chapter 2 of the book... here.
      © Hortonworks Inc. 2012   See                85
4.3) Networks




   © Hortonworks Inc. 2012   86
4.3.1) Weighted Email Networks in Pig

DEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACH
filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS
ego2;}/* Get email address pairs for each type of connection, and union them together */emails = LOAD '/me/Data/enron.avro' USING AvroStorage();from_to =
header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bc
Get a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) A
(ego1, ego2), COUNT_STAR(pairs) AS total;




                  © Hortonworks Inc. 2012                                                                                                     87
4.3.2) Networks Viz with Gephi




    © Hortonworks Inc. 2012      88
4.3.3) Gephi = Easy




   © Hortonworks Inc. 2012   89
4.3.4) Social Network Analysis




    © Hortonworks Inc. 2012      90
4.4) Time Series & Distributions


pig -l /tmp -x local -v -w


/* Count things per day */
things_per_day = foreach (group things by (key, ISOToDay(datetime))
generate flatten(group) as (key, day),
                   COUNT_STAR(things) as total;


/* Sort our totals per key by day to get a sorted time series */
things_timeseries = foreach (group things_by_day by key) {
timeseries = order things by day;
generate group as key, timeseries as timeseries;
};


store things_timeseries into '$mongourl' using MongoStorage();




          © Hortonworks Inc. 2012                                     91
4.4.1) Smooth Sparse Data




   © Hortonworks Inc. 2012   See here.   92
4.4.2) Regress to find Trends
JRuby Linear Regression UDF      Pig to use the UDF




                                 Trend Line in your Application




       © Hortonworks Inc. 2012                                    93
4.5.1) Natural Language Processing



 import 'tfidf.macro';
 my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

 /* Get the top 10 Tf*Idf scores per message */
 per_message_cassandra = foreach (group tfidf_all by message_id) {
   sorted = order tfidf_all by value desc;
   top_10_topics = limit sorted 10;
   generate group, top_10_topics.(score, value);
 }




       © Hortonworks Inc. 2012   Example with code here and macro here.   94
4.5.2) NLP: Extract Topics!




    © Hortonworks Inc. 2012   95
4.5.3) NLP for All: Extract Topics!


• TF-IDF in Pig - 2 lines of code with Pig Macros:
• http://hortonworks.com/blog/pig-macro-for-tf-idf-makes-
  topic-summarization-2-lines-of-pig/



• LDA with Pig and the Lucene Tokenizer:
• http://thedatachef.blogspot.be/2012/03/topic-discovery-
  with-apache-pig-and.html




      © Hortonworks Inc. 2012                               96
4.6) Probability & Bayesian Inference




     © Hortonworks Inc. 2012            97
4.6.1) Gmail Suggested Recipients




   © Hortonworks Inc. 2012          98
4.6.1) Reproducing it with Pig...




     © Hortonworks Inc. 2012        99
4.6.2) Step 1: COUNT(From -> To)




   © Hortonworks Inc. 2012         100
4.6.2) Step 2: COUNT(From, To, Cc)/Total




 P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone




         © Hortonworks Inc. 2012                                         101
4.6.3) Wait - Stop Here! It works!




                               They match...

    © Hortonworks Inc. 2012                102
4.4) Add predictions to reports




    © Hortonworks Inc. 2012       103
5) Enable new actions




   © Hortonworks Inc. 2012   104
Why doesn’t Kate reply to my emails?
• What time is best to catch her?

• Are they too long?

• Are they meant to be replied to (contain original content)?

• Are they nice? (sentiment analysis)

• Do I reply to her emails (reciprocity)?

• Do I cc the wrong people (my mom) ?

         © Hortonworks Inc. 2012                                105
Example: LinkedIn InMaps

  Shared at http://inmaps.linkedinlabs.com/share/Russell_Jurney/316288748096695765986412570341480077402




               <------ personalization drives engagement

    © Hortonworks Inc. 2012                                                                        106
Example: Packetpig and PacketLoop
snort_alerts = LOAD '$pcap'
  USING
com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig')
countries = FOREACH snort_alerts
  GENERATE
    com.packetloop.packetpig.udf.geoip.Country(src) as country,
    priority;

countries = GROUP countries BY country;
countries = FOREACH countries
  GENERATE
    group,
    AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');




                                                             Code here.
          © Hortonworks Inc. 2012                                           107
Example: Packetpig and PacketLoop




   © Hortonworks Inc. 2012          108
Thank You!
Questions & Answers

Slides: http://slidesha.re/T943VU

Follow: @hortonworks and @rjurney
Read: hortonworks.com/blog




     © Hortonworks Inc. 2012        109

Weitere ähnliche Inhalte

Was ist angesagt?

Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun ConnollyHortonworks
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your BudgetHortonworks
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopHortonworks
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
EDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old FavoriteEDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old FavoriteHortonworks
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopHortonworks
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNHortonworks
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformHortonworks
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageHortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 

Was ist angesagt? (20)

Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
State of the Union with Shaun Connolly
State of the Union with Shaun ConnollyState of the Union with Shaun Connolly
State of the Union with Shaun Connolly
 
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your Budget
 
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshopDeep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
Deep learning with Hortonworks and Apache Spark - Hortonworks technical workshop
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
EDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old FavoriteEDW Optimization: A Modern Twist on an Old Favorite
EDW Optimization: A Modern Twist on an Old Favorite
 
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache HadoopEnrich a 360-degree Customer View with Splunk and Apache Hadoop
Enrich a 360-degree Customer View with Splunk and Apache Hadoop
 
Combine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARNCombine SAS High-Performance Capabilities with Hadoop YARN
Combine SAS High-Performance Capabilities with Hadoop YARN
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data Platform
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2Hortonworks and Red Hat Webinar - Part 2
Hortonworks and Red Hat Webinar - Part 2
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Enterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble StorageEnterprise Hadoop with Hortonworks and Nimble Storage
Enterprise Hadoop with Hortonworks and Nimble Storage
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 

Ähnlich wie Agile Analytics Apps on HDP - Interactive Data Exploration

High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
 
High Speed Smart Data Ingest
High Speed Smart Data IngestHigh Speed Smart Data Ingest
High Speed Smart Data IngestDataWorks Summit
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopMark Ginnebaugh
 
Data Science with Hadoop - A primer
Data Science with Hadoop - A primerData Science with Hadoop - A primer
Data Science with Hadoop - A primerOfer Mendelevitch
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pigRavi Mutyala
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 
Migrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudMigrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudRoger Valade
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationMichael Rainey
 

Ähnlich wie Agile Analytics Apps on HDP - Interactive Data Exploration (20)

High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
High Speed Smart Data Ingest
High Speed Smart Data IngestHigh Speed Smart Data Ingest
High Speed Smart Data Ingest
 
Inside hadoop-dev
Inside hadoop-devInside hadoop-dev
Inside hadoop-dev
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
Data Science with Hadoop - A primer
Data Science with Hadoop - A primerData Science with Hadoop - A primer
Data Science with Hadoop - A primer
 
Introduction to pig
Introduction to pigIntroduction to pig
Introduction to pig
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
Migrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the CloudMigrating Core Enterprise Applications to the Cloud
Migrating Core Enterprise Applications to the Cloud
 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
 
Offload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data IntegrationOffload, Transform, and Present - the New World of Data Integration
Offload, Transform, and Present - the New World of Data Integration
 

Mehr von Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Kürzlich hochgeladen

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Agile Analytics Apps on HDP - Interactive Data Exploration

  • 1. Agile Analytics Applications on HDP Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks Formerly Viz, Data Science at Ning, LinkedIn HBase Dashboards, Career Explorer, InMaps © Hortonworks Inc. 2012 2
  • 2. About me... Bearding. • I’m going to beat this guy • Seriously • Bearding is my #1 natural talent • Salty Sea Beard • Fortified with Pacific Ocean Minerals © Hortonworks Inc. 2012 3
  • 3. Agile Data - The Book (July, 2013) Read on Safari Rough Cuts Early Release Here Code Here © Hortonworks Inc. 2012 4
  • 4. We go fast... but don’t worry! • Examples for EVERYTHING on the Hortonworks blog: http://hortonworks.com/blog/authors/russell_jurney • Download the slides - click the links - read examples! • If its not on the blog, its in the book! • Order now: http://shop.oreilly.com/product/0636920025054.do • Read the book Friday on Safari Rough Cuts © Hortonworks Inc. 2012 5
  • 5. HDP Sandbox - Talk Lessons Coming! © Hortonworks Inc. 2012 6
  • 6. Agile Application Development: Check • LAMP stack mature • Post-Rails frameworks to choose from • Enable rapid feedback and agility + NoSQL © Hortonworks Inc. 2012 8
  • 7. Data Warehousing © Hortonworks Inc. 2012 9
  • 8. Scientific Computing / HPC • ‘Smart kid’ only: MPI, Globus, etc. until Hadoop Tubes and Mercury (old school) Cores and Spindles (new school) UNIVAC and Deep Blue both fill a warehouse. We’re back... © Hortonworks Inc. 2012 10
  • 9. Data Science? Application Data Warehousing Development Scientific Computing / HPC © Hortonworks Inc. 2012 11
  • 10. Data Center as Computer • Warehouse Scale Computers and applications “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here for a paper on operating a ‘data center as computer.’ © Hortonworks Inc. 2012 12
  • 16. 18
  • 17. Tez – Faster MapReduce! 19
  • 18. Hadoop to the Rescue! © Hortonworks Inc. 2012 20
  • 19. Hadoop to the Rescue! • Easy to use! (Pig, Hive, Cascading) • CHEAP: 1% the cost of SAN/NAS • A department can afford its own Hadoop cluster! • Dump all your data in one place: Hadoop DFS • Silos come CRASHING DOWN! • JOIN like crazy! • ETL like whoah! • An army of mappers and reducers at your command • OMGWTFBBQ ITS SO GREAT! I FEEL AWESOME! © Hortonworks Inc. 2012 21
  • 20. NOW WHAT? © Hortonworks Inc. 2012 ? 22
  • 21. Analytics Apps: It takes a Team • Broad skill-set • Nobody has them all • Inherently collaborative © Hortonworks Inc. 2012 23
  • 22. Data Science Team • 3-4 team members with broad, diverse skill-sets that overlap • Transactional overhead dominates at 5+ people • Expert researchers: lend 25-50% of their time to teams • Creative workers. Run like a studio, not an assembly line • Total freedom... with goals and deliverables. • Work environment matters most © Hortonworks Inc. 2012 24
  • 23. How to get insight into product? • Back-end has gotten t-h-i-c-k-e-r • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • How do you collaborate on research vs developer timeline? © Hortonworks Inc. 2012 25
  • 24. The Wrong Way - Part One “We made a great design. Your job is to predict the future for it.” © Hortonworks Inc. 2012 26
  • 25. The Wrong Way - Part Two “Whats taking you so long to reliably predict the future?” © Hortonworks Inc. 2012 27
  • 26. The Wrong Way - Part Three “The users don’t understand what 86% true means.” © Hortonworks Inc. 2012 28
  • 27. The Wrong Way - Part Four GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!! © Hortonworks Inc. 2012 29
  • 28. The Wrong Way - Inevitable Conclusion Plane Mountain © Hortonworks Inc. 2012 30
  • 29. Reminds me of... the waterfall model © Hortonworks Inc. 2012 :( 31
  • 30. Chief Problem You can’t design insight in analytics applications. You discover it. You discover by exploring. © Hortonworks Inc. 2012 32
  • 31. -> Strategy So make an app for exploring your data. Iterate and publish intermediate results. Which becomes a palette for what you ship. © Hortonworks Inc. 2012 33
  • 32. Data Design • Not the 1st query that = insight, its the 15th, or the 150th • Capturing “Ah ha!” moments • Slow to do those in batch... • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in the wrong statistical models • Semantics of presenting predictions are complex, delicate • Opportunity lies at intersection of data & design © Hortonworks Inc. 2012 34
  • 33. How do we get back to Agile? © Hortonworks Inc. 2012 35
  • 34. Statement of Principles (then tricks, with code) © Hortonworks Inc. 2012 36
  • 35. Setup an environment where... • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day 0 • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more © Hortonworks Inc. 2012 37
  • 36. Value document > relation Most data is dirty. Most data is semi-structured or un-structured. Rejoice! © Hortonworks Inc. 2012 38
  • 37. Value document > relation Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. © Hortonworks Inc. 2012 39
  • 38. Relational Data = Legacy Format • Why JOIN? Storage is fundamentally cheap! • Duplicate that JOIN data in one big record type! • ETL once to document format on import, NOT every job • Not zero JOINs, but far fewer JOINs • Semi-structured documents preserve data’s actual structure • Column compressed document formats beat JOINs! (paper coming) © Hortonworks Inc. 2012 40
  • 39. Value imperative > declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. See: ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize © Hortonworks Inc. 2012 41
  • 40. Value dataflow > SELECT © Hortonworks Inc. 2012 42
  • 41. Ex. dataflow: ETL + email sent count (I can’t read this either. Get a big version here © Hortonworks Inc. 2012 43
  • 42. Value Pig > Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post. © Hortonworks Inc. 2012 44
  • 43. Localhost vs Petabyte scale: same tools tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3. • Everything we serve in our app is re-creatable via Hadoop. © Hortonworks Inc. 2012 45
  • 44. Data-Value Pyramid Climb it. Do not skip steps. See here. © Hortonworks Inc. 2012 46
  • 45. 0/1) Display atomic records on the web © Hortonworks Inc. 2012 47
  • 46. 0.0) Document-serialize events • Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. © Hortonworks Inc. 2012 48
  • 47. 0.1) Documents via Relation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray ); enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray); split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc'; headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. © Hortonworks Inc. 2012 49
  • 48. 0.2) Serialize events from streams class GmailSlurper(object): ... def init_imap(self, username, password): self.username = username self.password = password try: imap.shutdown() except: pass self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993) self.imap.login(username, password) self.imap.is_readonly = True ... def write(self, record): self.avro_writer.append(record) ... def slurp(self): if(self.imap and self.imap_folder): for email_id in self.id_list: (status, email_hash, charset) = self.fetch_email(email_id) if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash): print email_id, charset, email_hash['thread_id'] self.write(email_hash) © Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 50
  • 49. 0.3) ETL Logs log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); © Hortonworks Inc. 2012 51
  • 50. 1) Plumb atomic events -> browser (Example stack that enables high productivity) © Hortonworks Inc. 2012 52
  • 51. Lots of Stack Options with Examples • Pig with Voldemort, Ruby, Sinatra: example • Pig with ElasticSearch: example • Pig with MongoDB, Node.js: example • Pig with Cassandra, Python Streaming, Flask: example • Pig with HBase, JRuby, Sinatra: example • Pig with Hive via HCatalog: example (trivial on HDP) • Up next: Accumulo, Redis, MySQL, etc. © Hortonworks Inc. 2012 53
  • 52. 1.1) cat our Avro serialized events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': 28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com' u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } © Hortonworks Inc. 2012 Get cat_avro in python, ruby 54
  • 53. 1.2) Load our events in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} } © Hortonworks Inc. 2012 55
  • 54. 1.3) ILLUSTRATE our events in Pig grunt> illustrate enron_emails --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ © Hortonworks Inc. 2012 56
  • 55. 1.4) Publish our events to a ‘database’ From Avro to MongoDB in one command: pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig Which does this: /* MongoDB libraries and configuration */register /me/mongo-hadoop/mongo-2.7.3.jarre hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jarregister /me/mongo-hadoop/pig SNAPSHOT.jar/* Set speculative execution off to avoid chance of duplicate records in mapred.map.tasks.speculative.execution falseset mapred.reduce.tasks.speculative.exec com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut *//* By default, lets have 5 redu = load '$avros' using AvroStorage();store avros into '$mongourl' using MongoStorage( © Hortonworks Inc. 2012 Full instructions here. 57
  • 56. 1.5) Check events in our ‘database’ $ mongo enronMongoDB shell version: 2.0.2connecting to: enron> show collectionsemail db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}){ ObjectId("502b4ae703643a6a49c8d180"), "message_id" : "<1731.10095812390082.JavaM "2001-01-09T06:38:00.000Z", "from" : { "address" : "bob.dobbs@enron.com", "nam "subject" : Re: Enron trade for frop futures, "body" : "Scamming more people... "connie@enron", "name" : null } ], "ccs" : [ ], "bccs" : [ ]} © Hortonworks Inc. 2012 58
  • 57. 1.6) Publish events on the web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end © Hortonworks Inc. 2012 59
  • 58. 1.6) Publish events on the web © Hortonworks Inc. 2012 60
  • 59. Whats the point? • A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! © Hortonworks Inc. 2012 61
  • 60. 1.7) Wrap events with Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> Complete example here with code here. </body> © Hortonworks Inc. 2012 62
  • 61. 1.7) Wrap events with Bootstrap © Hortonworks Inc. 2012 63
  • 62. Refine. Add links between documents. Not the Mona Lisa, but coming along... See: here © Hortonworks Inc. 2012 64
  • 63. 1.8) List links to sorted events Use Pig, serve/cache a bag/array of email documents: pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use your ‘database’, if it can sort. mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", "from" : [ ... © Hortonworks Inc. 2012 66
  • 64. 1.8) List links to sorted documents © Hortonworks Inc. 2012 67
  • 65. 1.9) Make it searchable... If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch- 0.18.6/plugins'); Test it with curl: curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' ElasticSearch has no security features. Take note. Isolate. © Hortonworks Inc. 2012 68
  • 66. From now on we speed up... Don’t worry, its in the book and on the blog. http://hortonworks.com/blog/ © Hortonworks Inc. 2012 69
  • 67. 2) Create Simple Charts © Hortonworks Inc. 2012 70
  • 68. 2) Create Simple Tables and Charts © Hortonworks Inc. 2012 71
  • 69. 2) Create Simple Charts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. © Hortonworks Inc. 2012 72
  • 70. 2.1) Top N (of anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. © Hortonworks Inc. 2012 73
  • 71. 2.2) Time Series (of anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. © Hortonworks Inc. 2012 74
  • 72. Data processing in our stack A new feature in our application might begin at any layer... great! omghi2u! I’m creative! I’m creative too! where r my legs? I know Pig! I <3 Javascript! send halp Any team member can add new features, no problemo! © Hortonworks Inc. 2012 75
  • 73. Data processing in our stack ... but we shift the data-processing towards batch, as we are able. See real example here. Ex: Overall total emails calculated in each layer © Hortonworks Inc. 2012 76
  • 74. 3) Exploring with Reports © Hortonworks Inc. 2012 77
  • 75. 3) Exploring with Reports © Hortonworks Inc. 2012 78
  • 76. 3.0) From charts to reports... • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. © Hortonworks Inc. 2012 79
  • 77. 3.1) Looks like this... © Hortonworks Inc. 2012 80
  • 78. 3.2) Cultivate common keyspaces © Hortonworks Inc. 2012 81
  • 79. 3.3) Get people clicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. © Hortonworks Inc. 2012 82
  • 80. 4) Predictions and Recommendations © Hortonworks Inc. 2012 83
  • 81. 4.0) Preparation • We’ve already extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! © Hortonworks Inc. 2012 84
  • 82. 4.2) Think in different perspectives • Networks • Time Series / Distributions • Natural Language Processing • Conditional Probabilities / Bayesian Inference • Check out Chapter 2 of the book... here. © Hortonworks Inc. 2012 See 85
  • 83. 4.3) Networks © Hortonworks Inc. 2012 86
  • 84. 4.3.1) Weighted Email Networks in Pig DEFINE header_pairs(email, col1, col2) RETURNS pairs { filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL); flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2; $pairs = FOREACH flat GENERATE LOWER($col1) AS ego1, LOWER($col2) AS ego2;}/* Get email address pairs for each type of connection, and union them together */emails = LOAD '/me/Data/enron.avro' USING AvroStorage();from_to = header_pairs(emails, from, to);from_cc = header_pairs(emails, from, cc);from_bcc = header_pairs(emails, from, bcc);pairs = UNION from_to, from_cc, from_bc Get a count of emails over these edges. */pair_groups = GROUP pairs BY (ego1, ego2);sent_counts = FOREACH pair_groups GENERATE FLATTEN(group) A (ego1, ego2), COUNT_STAR(pairs) AS total; © Hortonworks Inc. 2012 87
  • 85. 4.3.2) Networks Viz with Gephi © Hortonworks Inc. 2012 88
  • 86. 4.3.3) Gephi = Easy © Hortonworks Inc. 2012 89
  • 87. 4.3.4) Social Network Analysis © Hortonworks Inc. 2012 90
  • 88. 4.4) Time Series & Distributions pig -l /tmp -x local -v -w /* Count things per day */ things_per_day = foreach (group things by (key, ISOToDay(datetime)) generate flatten(group) as (key, day), COUNT_STAR(things) as total; /* Sort our totals per key by day to get a sorted time series */ things_timeseries = foreach (group things_by_day by key) { timeseries = order things by day; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); © Hortonworks Inc. 2012 91
  • 89. 4.4.1) Smooth Sparse Data © Hortonworks Inc. 2012 See here. 92
  • 90. 4.4.2) Regress to find Trends JRuby Linear Regression UDF Pig to use the UDF Trend Line in your Application © Hortonworks Inc. 2012 93
  • 91. 4.5.1) Natural Language Processing import 'tfidf.macro'; my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body'); /* Get the top 10 Tf*Idf scores per message */ per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value); } © Hortonworks Inc. 2012 Example with code here and macro here. 94
  • 92. 4.5.2) NLP: Extract Topics! © Hortonworks Inc. 2012 95
  • 93. 4.5.3) NLP for All: Extract Topics! • TF-IDF in Pig - 2 lines of code with Pig Macros: • http://hortonworks.com/blog/pig-macro-for-tf-idf-makes- topic-summarization-2-lines-of-pig/ • LDA with Pig and the Lucene Tokenizer: • http://thedatachef.blogspot.be/2012/03/topic-discovery- with-apache-pig-and.html © Hortonworks Inc. 2012 96
  • 94. 4.6) Probability & Bayesian Inference © Hortonworks Inc. 2012 97
  • 95. 4.6.1) Gmail Suggested Recipients © Hortonworks Inc. 2012 98
  • 96. 4.6.1) Reproducing it with Pig... © Hortonworks Inc. 2012 99
  • 97. 4.6.2) Step 1: COUNT(From -> To) © Hortonworks Inc. 2012 100
  • 98. 4.6.2) Step 2: COUNT(From, To, Cc)/Total P(cc | to) = Probability of cc’ing someone, given that you’ve to’d someone © Hortonworks Inc. 2012 101
  • 99. 4.6.3) Wait - Stop Here! It works! They match... © Hortonworks Inc. 2012 102
  • 100. 4.4) Add predictions to reports © Hortonworks Inc. 2012 103
  • 101. 5) Enable new actions © Hortonworks Inc. 2012 104
  • 102. Why doesn’t Kate reply to my emails? • What time is best to catch her? • Are they too long? • Are they meant to be replied to (contain original content)? • Are they nice? (sentiment analysis) • Do I reply to her emails (reciprocity)? • Do I cc the wrong people (my mom) ? © Hortonworks Inc. 2012 105
  • 103. Example: LinkedIn InMaps Shared at http://inmaps.linkedinlabs.com/share/Russell_Jurney/316288748096695765986412570341480077402 <------ personalization drives engagement © Hortonworks Inc. 2012 106
  • 104. Example: Packetpig and PacketLoop snort_alerts = LOAD '$pcap' USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig') countries = FOREACH snort_alerts GENERATE com.packetloop.packetpig.udf.geoip.Country(src) as country, priority; countries = GROUP countries BY country; countries = FOREACH countries GENERATE group, AVG(countries.priority) as average_severity; STORE countries into 'output/choropleth_countries' using PigStorage(','); Code here. © Hortonworks Inc. 2012 107
  • 105. Example: Packetpig and PacketLoop © Hortonworks Inc. 2012 108
  • 106. Thank You! Questions & Answers Slides: http://slidesha.re/T943VU Follow: @hortonworks and @rjurney Read: hortonworks.com/blog © Hortonworks Inc. 2012 109