SlideShare ist ein Scribd-Unternehmen logo
1 von 81
Agile Analytics Applications
Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks

Formerly Viz, Data Science at Ning, LinkedIn

HBase Dashboards, Career Explorer, InMaps




© Hortonworks Inc. 2012
                                                             1
Agile Data - The Book (March, 2013)

                              Read it now on OFPS



                                 A philosophy,
                                 not the only way



                             But still, its good! Really!
   © Hortonworks Inc. 2012                             2
Situation in Brief




    © Hortonworks Inc. 2012   3
Agile Application Development
• LAMP stack mature
• Post-Rails frameworks to choose from
• We enable rapid feedback and agility




                                   + NoSQL



      © Hortonworks Inc. 2012                4
Data Warehousing




   © Hortonworks Inc. 2012   5
Scientific Computing / HPC
  • ‘Smart kid’ only: MPI, Globus, etc.
  • Math heavy




Tubes and Mercury (old school)      Cores and Spindles (new school)

          © Hortonworks Inc. 2012                                     6
Data Science?

 Application
                                                           Data Warehousing
Development

                                33%               33%




                                         33%


                               Scientific Computing / HPC
     © Hortonworks Inc. 2012                                                  7
Data Center as Computer
    • Warehouse Scale Computers and applications




“A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.”
Click here.

              © Hortonworks Inc. 2012                                                                 8
Hadoop to the Rescue!
             Big data refinery / Modernize ETL
         Audio,                              Web, Mobile, CRM,
         Video,                                   ERP, SCM, …
        Images
                       New Data                                    Business
                                                                 Transactions
         Docs,         Sources
         Text,                                                   & Interactions
         XML

                                              HDFS
         Web
         Logs,
         Clicks
                              Big Data
        Social,               Refinery                            SQL   NoSQL     NewSQL
        Graph,
        Feeds
                                                                                           ETL
                                                                  EDW    MPP      NewSQL
       Sensors,
       Devices,
        RFID

                                                                    Business
        Spatial,
         GPS                 Apache Hadoop
                                                                   Intelligence
                                                                   & Analytics
        Events,
         Other                                Dashboards, Reports,
                                                   Visualization, …

                                                                                            Page 7



   © Hortonworks Inc. 2012                                    I stole this slide from Eric.          9
Hadoop to the Rescue!
• A department can afford a Hadoop cluster, let alone an org
• Dump all your data in one place: HDFS
• JOIN like crazy!

• ETL like whoah!
• An army of mappers and reducers at your command
• Now what?




      © Hortonworks Inc. 2012                              10
Analytics Apps: It takes a Team
• Broad skill-set to make useful apps
• Basically nobody has them all
• Application development is inherently collaborative




      © Hortonworks Inc. 2012                           11
How to get insight into product?
• Back-end has gotten t-h-i-c-k-e-r
• Generating $$$ insight can take 10-100x app dev
• Timeline disjoint: analytics vs agile app-dev/design

• How do you ship insights efficiently?
• How do you collaborate on research vs developer timeline?




        © Hortonworks Inc. 2012                               12
The Wrong Way - Part One




“We made a great design. Your job is to predict the future for it.”




        © Hortonworks Inc. 2012                                 13
The Wrong Way - Part Two




“Whats taking you so long to reliably predict the future?”




     © Hortonworks Inc. 2012                                 14
The Wrong Way - Part Three




  “The users don’t understand what 86% true means.”




    © Hortonworks Inc. 2012                           15
The Wrong Way - Part Four




 GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!!




   © Hortonworks Inc. 2012                          16
The Wrong Way - Inevitable Conclusion




                             Plane   Mountain


   © Hortonworks Inc. 2012                      17
Reminds me of...




   © Hortonworks Inc. 2012   18
Chief Problem


You can’t design insight in analytics applications.


                               You discover it.


                      You discover by exploring.


     © Hortonworks Inc. 2012                       19
-> Strategy


   So make an app for exploring your data.


  Iterate and publish intermediate results.


 Which becomes a palette for what you ship.


    © Hortonworks Inc. 2012                   20
Data Design

• Not the 1st query that = insight, its the 15th, or the 150th
• Capturing “Ah ha!” moments
• Slow to do those in batch...

• Faster, better context in an interactive web application.
• Pre-designed charts wind up terrible. So bad.
• Easy to invest man-years in the wrong statistical models
• Semantics of presenting predictions are complex, delicate

• Opportunity lies at intersection of data & design


      © Hortonworks Inc. 2012                                    21
How do we get back to Agile?




   © Hortonworks Inc. 2012     22
Statement of Principles




                              (then tricks, with code)




    © Hortonworks Inc. 2012                              23
Setup an environment where...
• Insights repeatedly produced
• Iterative work shared with entire team
• Interactive from day 0

• Data model is consistent end-to-end
• Minimal impedance between layers
• Scope and depth of insights grow
• Insights form the palette for what you ship

• Until the application pays for itself and more


      © Hortonworks Inc. 2012                      24
Value document > relation




Most data is dirty. Most data is semi-structured or un-structured. Rejoice!
        © Hortonworks Inc. 2012                                         25
Value document > relation




Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction.
         © Hortonworks Inc. 2012                                                26
Value imperative > declarative
• We don’t know what we want to SELECT.
• Data is dirty - check each step, clean iteratively.
• 85% of data scientist’s time spent munging. See: ETL.

• Imperative is optimized for our process.
• Process = iterative, snowballing insight
• Efficiency matters, self optimize




      © Hortonworks Inc. 2012                             27
Value dataflow > SELECT




   © Hortonworks Inc. 2012   28
Example dataflow: ETL + email sent count




     © Hortonworks Inc. 2012   (I can’t read this either. Get a big version here.)   29
Value Pig > Hive (for app-dev)
• Pigs eat ANYTHING
• Pig is optimized for refining data, as opposed to consuming it
• Pig is imperative, iterative
• Pig is dataflows, and SQLish (but not SQL)
• Code modularization/re-use: Pig Macros
• ILLUSTRATE speeds dev time (even UDFs)
• Easy UDFs in Java, JRuby, Jython, Javascript
• Pig Streaming = use any tool, period.
• Easily prepare our data as it will appear in our app.
• If you prefer Hive, use Hive.
But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive...
                 See: HCatalog for Pig/Hive integration, and this post.

        © Hortonworks Inc. 2012                                                       30
Localhost vs Petabyte scale: same tools
• Simplicity essential to scalability: highest level tools we can
• Prepare a good sample - tricky with joins, easy with
  documents
• Local mode: pig -l /tmp -x local -v -w
• Frequent use of ILLUSTRATE
• 1st: Iterate, debug & publish locally
• 2nd: Run on cluster, publish to team/customer
• Consider skipping Object-Relational-Mapping (ORM)
• We do not trust ‘databases,’ only HDFS @ n=3.
• Everything we serve in our app is re-creatable via Hadoop.

      © Hortonworks Inc. 2012                                   31
Data-Value Pyramid




                Climb it. Do not skip steps. See here.

   © Hortonworks Inc. 2012                               32
0/1) Display atomic records on the web




    © Hortonworks Inc. 2012              33
0.0) Document-serialize events
• Protobuf
• Thrift
• JSON

• Avro - I use Avro because the schema is onboard.




       © Hortonworks Inc. 2012                       34
0.1) Documents via Relation ETL
enron_messages = load '/enron/enron_messages.tsv' as (
     message_id:chararray,
     sql_date:chararray,
     from_address:chararray,
     from_name:chararray,
     subject:chararray,
     body:chararray
);
 
enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);
 
split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';
 
headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10;
with_headers = join headers by group, enron_messages by message_id parallel 10;
emails = foreach with_headers generate enron_messages::message_id as message_id,
                                      CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date,
                                      TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray),
                                      enron_messages::subject as subject,
                                      enron_messages::body as body,
                                      headers::tos.(address, name) as tos,
                                      headers::ccs.(address, name) as ccs,
                                      headers::bccs.(address, name) as bccs;


store emails into '/enron/emails.avro' using AvroStorage(


                                                                                                                    Example here.
                      © Hortonworks Inc. 2012                                                                                                  35
0.2) Serialize events from streams
class GmailSlurper(object):
  ...
  def init_imap(self, username, password):
    self.username = username
    self.password = password
    try:
      imap.shutdown()
    except:
      pass
    self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)
    self.imap.login(username, password)
    self.imap.is_readonly = True
  ...
  def write(self, record):
    self.avro_writer.append(record)
  ...
  def slurp(self):
    if(self.imap and self.imap_folder):
      for email_id in self.id_list:
        (status, email_hash, charset) = self.fetch_email(email_id)
        if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):
          print email_id, charset, email_hash['thread_id']
          self.write(email_hash)




               © Hortonworks Inc. 2012   Scrape your own gmail in Python and Ruby.                36
0.3) ETL Logs



log_data = LOAD 'access_log'
  USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader
  AS (remoteAddr,
    remoteLogname,
    user,
    time,
    method,
    uri,
    proto,
    bytes);




      © Hortonworks Inc. 2012                                         37
1) Plumb atomic events -> browser




      (Example stack that enables high productivity)

   © Hortonworks Inc. 2012                             38
Lots of Stack Options with Examples
• Pig with Voldemort, Ruby, Sinatra: example
• Pig with ElasticSearch: example
• Pig with MongoDB, Node.js: example

• Pig with Cassandra, Python Streaming, Flask: example
• Pig with HBase, JRuby, Sinatra: example
• Pig with Hive via HCatalog: example (trivial on HDP)
• Up next: Accumulo, Redis, MySQL, etc.




      © Hortonworks Inc. 2012                            39
1.1) cat our Avro serialized events

me$ cat_avro ~/Data/enron.avro
{
    u'bccs': [],
    u'body': u'scamming people, blah blah',
    u'ccs': [],
    u'date': u'2000-08-28T01:50:00.000Z',
    u'from': {u'address': u'bob.dobbs@enron.com', u'name': None},
    u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>',
    u'subject': u'Re: Enron trade for frop futures',
    u'tos': [
      {u'address': u'connie@enron.com', u'name': None}
    ]
}



        © Hortonworks Inc. 2012   Get cat_avro in python, ruby   40
1.2) Load our events in Pig

me$ pig -l /tmp -x local -v -w
grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage();
grunt> describe enron_emails

emails: {
  message_id: chararray,
  datetime: chararray,
  from:tuple(address:chararray,name:chararray)
  subject: chararray,
  body: chararray,
  tos: {to: (address: chararray,name: chararray)},
  ccs: {cc: (address: chararray,name: chararray)},
  bccs: {bcc: (address: chararray,name: chararray)}
}

 



       © Hortonworks Inc. 2012                                    41
1.3) ILLUSTRATE our events in Pig
grunt> illustrate enron_emails
 



---------------------------------------------------------------------------
| emails |
| message_id:chararray |
| datetime:chararray |
| from:tuple(address:chararray,name:chararray) |
| subject:chararray |
| body:chararray |
| tos:bag{to:tuple(address:chararray,name:chararray)} |
| ccs:bag{cc:tuple(address:chararray,name:chararray)} |
| bccs:bag{bcc:tuple(address:chararray,name:chararray)} |
---------------------------------------------------------------------------
|        |
| <1731.10095812390082.JavaMail.evans@thyme> |
| 2001-01-09T06:38:00.000Z |
| (bob.dobbs@enron.com, J.R. Bob Dobbs) |
| Re: Enron trade for frop futures |
| scamming people, blah blah |
| {(connie@enron.com,)} |
| {} |
| {} |
                                                   Upgrade to Pig 0.10+
         © Hortonworks Inc. 2012                                              42
1.4) Publish our events to a ‘database’
From Avro to MongoDB in one command:
pig -l /tmp -x local -v -w -param avros=enron.avro 
   -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig


Which does this:
/* MongoDB libraries and configuration */
register /me/mongo-hadoop/mongo-2.7.3.jar
register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

/* Set speculative execution off to avoid chance of duplicate records in Mongo */
set mapred.map.tasks.speculative.execution false
set mapred.reduce.tasks.speculative.execution false
define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */

/* By default, lets have 5 reducers */
set default_parallel 5

avros = load '$avros' using AvroStorage();
store avros into '$mongourl' using MongoStorage();


          © Hortonworks Inc. 2012                Full instructions here.     43
1.5) Check events in our ‘database’
$ mongo enron

MongoDB shell version: 2.0.2
connecting to: enron

> show collections
emails
system.indexes

> db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"})
{
"   "_id" : ObjectId("502b4ae703643a6a49c8d180"),
"   "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>",
"   "date" : "2001-01-09T06:38:00.000Z",
"   "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" },
"   "subject" : Re: Enron trade for frop futures,
"   "body" : "Scamming more people...",
"   "tos" : [ { "address" : "connie@enron", "name" : null } ],
"   "ccs" : [ ],
"   "bccs" : [ ]
}




          © Hortonworks Inc. 2012                                             44
1.6) Publish events on the web

require    'rubygems'
require    'sinatra'
require    'mongo'
require    'json'

connection = Mongo::Connection.new
database = connection['agile_data']
collection = database['emails']

get '/email/:message_id' do |message_id|
  data = collection.find_one({:message_id => message_id})
  JSON.generate(data)
end




          © Hortonworks Inc. 2012                     45
1.6) Publish events on the web




    © Hortonworks Inc. 2012      46
Whats the point?
• A designer can work against real data.
• An application developer can work against real data.
• A product manager can think in terms of real data.

• Entire team is grounded in reality!
• You’ll see how ugly your data really is.
• You’ll see how much work you have yet to do.
• Ship early and often!

• Feels agile, don’t it? Keep it up!


      © Hortonworks Inc. 2012                            47
1.7) Wrap events with Bootstrap
<link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet">
</head>
<body>
<div class="container" style="margin-top: 100px;">
  <table class="table table-striped table-bordered table-condensed">
    <thead>
    {% for key in data['keys'] %}
         <th>{{ key }}</th>
    {% endfor %}
    </thead>
    <tbody>
         <tr>
         {% for value in data['values'] %}
           <td>{{ value }}</td>
         {% endfor %}
         </tr>
    </tbody>
  </table>
</div>
</body>                              Complete example here with code here.
           © Hortonworks Inc. 2012                                               48
1.7) Wrap events with Bootstrap




    © Hortonworks Inc. 2012       49
Refine. Add links between documents.




                         Not the Mona Lisa, but coming along... See: here
   © Hortonworks Inc. 2012                                                  50
The Mona Lisa. In pure CSS.




   © Hortonworks Inc. 2012    See: here   51
1.8) List links to sorted events
Use Pig, serve/cache a bag/array of email documents:
pig -l /tmp -x local -v -w


emails_per_user = foreach (group emails by from.address) {
      sorted = order emails by date;
      last_1000 = limit sorted 1000;
      generate group as from_address, emails as emails;
      };


store emails_per_user into '$mongourl' using MongoStorage();

Use your ‘database’, if it can sort.
mongo enron
> db.emails.ensureIndex({message_id: 1})
> db.emails.find().sort({date:0}).limit(10).pretty()
  {
           {
           " "_id" : ObjectId("4f7a5da2414e4dd0645d1176"),
           " "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>",
           " "from" : [
  ...

               © Hortonworks Inc. 2012                                                        52
1.8) List links to sorted documents




    © Hortonworks Inc. 2012           53
1.9) Make it searchable...
If you have list, search is easy with ElasticSearch and Wonderdog...

/* Load ElasticSearch integration */
register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar';
register '/me/elasticsearch-0.18.6/lib/*';
define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage();


emails = load '/me/tmp/emails' using AvroStorage();
store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/
elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins');




Test it with curl:
 curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1'



ElasticSearch has no security features. Take note. Isolate.

          © Hortonworks Inc. 2012                                                     54
From now on we speed up...




         Don’t worry, its in the book and on the blog.

                             http://hortonworks.com/blog/




   © Hortonworks Inc. 2012                                  55
2) Create Simple Charts




   © Hortonworks Inc. 2012   56
2) Create Simple Tables and Charts




   © Hortonworks Inc. 2012           57
2) Create Simple Charts
• Start with an HTML table on general principle.
• Then use nvd3.js - reusable charts for d3.js
• Aggregate by properties & displaying is first step in entity
 resolution
• Start extracting entities. Ex: people, places, topics, time series
• Group documents by entities, rank and count.

• Publish top N, time series, etc.
• Fill a page with charts.
• Add a chart to your event page.
        © Hortonworks Inc. 2012                                  58
2.1) Top N (of anything) in Pig


pig -l /tmp -x local -v -w

top_things = foreach (group things by key) {
  sorted = order things by arbitrary_rank desc;
  top_10_things = limit sorted 10;
  generate group as key, top_10_things as top_10_things;
  };
store top_n into '$mongourl' using MongoStorage();


Remember, this is the same structure the browser gets as json.

                       This would make a good Pig Macro.

       © Hortonworks Inc. 2012                             59
2.2) Time Series (of anything) in Pig
pig -l /tmp -x local -v -w

/* Group by our key and date rounded to the month, get a total */
things_by_month = foreach (group things by (key, ISOToMonth(datetime))
   generate flatten(group) as (key, month),
            COUNT_STAR(things) as total;

/* Sort our totals per key by month to get a time series */
things_timeseries = foreach (group things_by_month by key) {
   timeseries = order things by month;
   generate group as key, timeseries as timeseries;
   };

store things_timeseries into '$mongourl' using MongoStorage();



                                  Yet another good Pig Macro.


        © Hortonworks Inc. 2012                                    60
Data processing in our stack

A new feature in our application might begin at any layer... great!




                                                                        omghi2u!
           I’m creative!                         I’m creative too!   where r my legs?
            I know Pig!                          I <3 Javascript!
                                                                        send halp




    Any team member can add new features, no problemo!
        © Hortonworks Inc. 2012                                            61
Data processing in our stack

... but we shift the data-processing towards batch, as we are able.




                                                       See real example here.
                               Ex: Overall total emails calculated in each layer
         © Hortonworks Inc. 2012                                             62
3) Exploring with Reports




    © Hortonworks Inc. 2012   63
3) Exploring with Reports




    © Hortonworks Inc. 2012   64
3.0) From charts to reports...

• Extract entities from properties we aggregated by in charts (Step 2)

• Each entity gets its own type of web page

• Each unique entity gets its own web page
• Link to entities as they appear in atomic event documents (Step 1)

• Link most related entities together, same and between types.

• More visualizations!

• Parametize results via forms.




          © Hortonworks Inc. 2012                                  65
3.1) Looks like this...




    © Hortonworks Inc. 2012   66
3.2) Cultivate common keyspaces




   © Hortonworks Inc. 2012        67
3.3) Get people clicking. Learn.
• Explore this web of generated pages, charts and links!
• Everyone on the team gets to know your data.
• Keep trying out different charts, metrics, entities, links.

• See whats interesting.
• Figure out what data needs cleaning and clean it.
• Start thinking about predictions & recommendations.



    ‘People’ could be just your team, if data is sensitive.

      © Hortonworks Inc. 2012                                   68
4) Predictions and Recommendations




   © Hortonworks Inc. 2012           69
4.0) Preparation
• We’ve already extracted entities, their properties and relationships

• Our charts show where our signal is rich

• We’ve cleaned our data to make it presentable
• The entire team has an intuitive understanding of the data

• They got that understanding by exploring the data

• We are all on the same page!




         © Hortonworks Inc. 2012                                   70
4.1) Smooth sparse data




   © Hortonworks Inc. 2012   See here.   71
4.2) Think in different perspectives
• Networks
• Time Series
• Distributions

• Natural Language
• Probability / Bayes




      © Hortonworks Inc. 2012   See here.   72
4.3) Sink more time in deeper analysis
TF-IDF
import 'tfidf.macro';
my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');


/* Get the top 10 Tf*Idf scores per message */
per_message_cassandra = foreach (group tfidf_all by message_id) {
  sorted = order tfidf_all by value desc;
  top_10_topics = limit sorted 10;
  generate group, top_10_topics.(score, value);
}



Probability / Bayes
sent_replies = join sent_counts by (from, to), reply_counts by (from, to);
reply_ratios = foreach sent_replies generate sent_counts::from as from,
                                                  sent_counts::to as to,
                                                  (float)reply_counts::total/(float)sent_counts::tot
as ratio;
reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio;




            © Hortonworks Inc. 2012   Example with code here and macro here.                 73
4.4) Add predictions to reports




    © Hortonworks Inc. 2012       74
5) Enable new actions




   © Hortonworks Inc. 2012   75
Example: Packetpig and PacketLoop
snort_alerts = LOAD '$pcap'
  USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig');

countries = FOREACH snort_alerts
  GENERATE
    com.packetloop.packetpig.udf.geoip.Country(src) as country,
    priority;

countries = GROUP countries BY country;

countries = FOREACH countries
  GENERATE
    group,
    AVG(countries.priority) as average_severity;

STORE countries into 'output/choropleth_countries' using PigStorage(',');




                                                                     Code here.
           © Hortonworks Inc. 2012                                                76
Example: Packetpig and PacketLoop




   © Hortonworks Inc. 2012          77
Hortonworks Data Platform
                                                     • Simplify deployment to get
                                                       started quickly and easily

                                                     • Monitor, manage any size
                                                       cluster with familiar console
                                                       and tools

                                                     • Only platform to include data
                                1                      integration services to
                                                       interact with any data

                                                     • Metadata services opens the
                                                       platform for integration with
                                                       existing applications

                                                     • Dependable high availability
                                                       architecture
 Reduce risks and cost of adoption
                                                     • Tested at scale to future proof
 Lower the total cost to administer and provision     your cluster growth
 Integrate with your existing ecosystem


      © Hortonworks Inc. 2012                                                     78
Hortonworks Training
                            The expert source for
                            Apache Hadoop training & certification

Role-based Developer and Administration training
  –   Coursework built and maintained by the core Apache Hadoop development team.
  –   The “right” course, with the most extensive and realistic hands-on materials
  –   Provide an immersive experience into real-world Hadoop scenarios
  –   Public and Private courses available



Comprehensive Apache Hadoop Certification
  – Become a trusted and valuable
    Apache Hadoop expert




        © Hortonworks Inc. 2012                                                      79
Next Steps?

1                                 Download Hortonworks Data Platform
                                  hortonworks.com/download




2   Use the getting started guide
    hortonworks.com/get-started




3   Learn more… get support

                                                             Hortonworks Support
       • Expert role based training                          • Full lifecycle technical support
       • Course for admins, developers                         across four service levels
         and operators                                       • Delivered by Apache Hadoop
       • Certification program                                 Experts/Committers
       • Custom onsite options                               • Forward-compatible

        hortonworks.com/training                             hortonworks.com/support


        © Hortonworks Inc. 2012                                                                   80
Thank You!
Questions & Answers

Follow: @hortonworks and @rjurney
Read: hortonworks.com/blog




     © Hortonworks Inc. 2012        81

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hortonworks
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3Hortonworks
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your BudgetHortonworks
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopHortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsHortonworks
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataHortonworks
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNDataWorks Summit
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformHortonworks
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopHortonworks
 

Was ist angesagt? (20)

Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
Hadoop Operations, Innovations and Enterprise Readiness with Hortonworks Data...
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Bigger Data For Your Budget
Bigger Data For Your BudgetBigger Data For Your Budget
Bigger Data For Your Budget
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache HadoopRescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
Rescue your Big Data from Downtime with HP Operations Bridge and Apache Hadoop
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Introduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for WindowsIntroduction to Hortonworks Data Platform for Windows
Introduction to Hortonworks Data Platform for Windows
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
Enabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARNEnabling Diverse Workload Scheduling in YARN
Enabling Diverse Workload Scheduling in YARN
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Introduction to Hortonworks Data Platform
Introduction to Hortonworks Data PlatformIntroduction to Hortonworks Data Platform
Introduction to Hortonworks Data Platform
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 

Andere mochten auch

Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitSaptak Sen
 
Hortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBoxHortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBoxHortonworks
 
Operationalized Analytics in the Enterprise
Operationalized Analytics in the EnterpriseOperationalized Analytics in the Enterprise
Operationalized Analytics in the EnterpriseRon Bodkin
 
The Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data HubThe Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data HubCloudera, Inc.
 
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Athemaster Co., Ltd.
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹Anna Yen
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves HadoopCloudera, Inc.
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
 
2014 年十大商业智能趋势
2014 年十大商业智能趋势2014 年十大商业智能趋势
2014 年十大商业智能趋势Tableau Software
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with sparkHortonworks
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...Hortonworks
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache AmbariHortonworks
 

Andere mochten auch (20)

Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Hortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBoxHortonworks Sandbox Startup Guide for VirtualBox
Hortonworks Sandbox Startup Guide for VirtualBox
 
Operationalized Analytics in the Enterprise
Operationalized Analytics in the EnterpriseOperationalized Analytics in the Enterprise
Operationalized Analytics in the Enterprise
 
The Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data HubThe Benefits of Predictive and Proactive Support for an Enterprise Data Hub
The Benefits of Predictive and Proactive Support for an Enterprise Data Hub
 
Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)Cloudera security and enterprise license by Athemaster(繁中)
Cloudera security and enterprise license by Athemaster(繁中)
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
Apache hadoop and cdh(cloudera distribution) introduction 基本介紹
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks Solving Big Data Problems using Hortonworks
Solving Big Data Problems using Hortonworks
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
2014 年十大商业智能趋势
2014 年十大商业智能趋势2014 年十大商业智能趋势
2014 年十大商业智能趋势
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Hortonworks Technical Workshop: Apache Ambari
Hortonworks Technical Workshop:   Apache AmbariHortonworks Technical Workshop:   Apache Ambari
Hortonworks Technical Workshop: Apache Ambari
 

Ähnlich wie Hortonworks: Agile Analytics Applications

Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesUtrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesHortonworks
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopHortonworks
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopHortonworks
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPHortonworks
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPHortonworks
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Mac Moore
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopMark Ginnebaugh
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGskumpf
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrowSteve Loughran
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Mac Moore
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData DayJohn Park
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopDataWorks Summit
 
High Speed Smart Data Ingest
High Speed Smart Data IngestHigh Speed Smart Data Ingest
High Speed Smart Data IngestDataWorks Summit
 

Ähnlich wie Hortonworks: Agile Analytics Applications (20)

Utrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data SlidesUtrecht NL-HUG/Data Science-NL - Agile Data Slides
Utrecht NL-HUG/Data Science-NL - Agile Data Slides
 
Paris HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on HadoopParis HUG - Agile Analytics Applications on Hadoop
Paris HUG - Agile Analytics Applications on Hadoop
 
UK - Agile Data Applications on Hadoop
UK - Agile Data Applications on HadoopUK - Agile Data Applications on Hadoop
UK - Agile Data Applications on Hadoop
 
Orange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDPOrange County HUG - Agile Data on HDP
Orange County HUG - Agile Data on HDP
 
LA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDPLA HUG - Agile Analytics Applications on HDP
LA HUG - Agile Analytics Applications on HDP
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015Storm Demo Talk - Colorado Springs May 2015
Storm Demo Talk - Colorado Springs May 2015
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
 
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUGReal-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
 
Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015Storm Demo Talk - Denver Apr 2015
Storm Demo Talk - Denver Apr 2015
 
SoCal BigData Day
SoCal BigData DaySoCal BigData Day
SoCal BigData Day
 
High Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into HadoopHigh Speed Continuous & Reliable Data Ingest into Hadoop
High Speed Continuous & Reliable Data Ingest into Hadoop
 
High Speed Smart Data Ingest
High Speed Smart Data IngestHigh Speed Smart Data Ingest
High Speed Smart Data Ingest
 

Hortonworks: Agile Analytics Applications

  • 1. Agile Analytics Applications Russell Jurney (@rjurney) - Hadoop Evangelist @Hortonworks Formerly Viz, Data Science at Ning, LinkedIn HBase Dashboards, Career Explorer, InMaps © Hortonworks Inc. 2012 1
  • 2. Agile Data - The Book (March, 2013) Read it now on OFPS A philosophy, not the only way But still, its good! Really! © Hortonworks Inc. 2012 2
  • 3. Situation in Brief © Hortonworks Inc. 2012 3
  • 4. Agile Application Development • LAMP stack mature • Post-Rails frameworks to choose from • We enable rapid feedback and agility + NoSQL © Hortonworks Inc. 2012 4
  • 5. Data Warehousing © Hortonworks Inc. 2012 5
  • 6. Scientific Computing / HPC • ‘Smart kid’ only: MPI, Globus, etc. • Math heavy Tubes and Mercury (old school) Cores and Spindles (new school) © Hortonworks Inc. 2012 6
  • 7. Data Science? Application Data Warehousing Development 33% 33% 33% Scientific Computing / HPC © Hortonworks Inc. 2012 7
  • 8. Data Center as Computer • Warehouse Scale Computers and applications “A key challenge for architects of WSCs is to smooth out these discrepancies in a cost efficient manner.” Click here. © Hortonworks Inc. 2012 8
  • 9. Hadoop to the Rescue! Big data refinery / Modernize ETL Audio, Web, Mobile, CRM, Video, ERP, SCM, … Images New Data Business Transactions Docs, Sources Text, & Interactions XML HDFS Web Logs, Clicks Big Data Social, Refinery SQL NoSQL NewSQL Graph, Feeds ETL EDW MPP NewSQL Sensors, Devices, RFID Business Spatial, GPS Apache Hadoop Intelligence & Analytics Events, Other Dashboards, Reports, Visualization, … Page 7 © Hortonworks Inc. 2012 I stole this slide from Eric. 9
  • 10. Hadoop to the Rescue! • A department can afford a Hadoop cluster, let alone an org • Dump all your data in one place: HDFS • JOIN like crazy! • ETL like whoah! • An army of mappers and reducers at your command • Now what? © Hortonworks Inc. 2012 10
  • 11. Analytics Apps: It takes a Team • Broad skill-set to make useful apps • Basically nobody has them all • Application development is inherently collaborative © Hortonworks Inc. 2012 11
  • 12. How to get insight into product? • Back-end has gotten t-h-i-c-k-e-r • Generating $$$ insight can take 10-100x app dev • Timeline disjoint: analytics vs agile app-dev/design • How do you ship insights efficiently? • How do you collaborate on research vs developer timeline? © Hortonworks Inc. 2012 12
  • 13. The Wrong Way - Part One “We made a great design. Your job is to predict the future for it.” © Hortonworks Inc. 2012 13
  • 14. The Wrong Way - Part Two “Whats taking you so long to reliably predict the future?” © Hortonworks Inc. 2012 14
  • 15. The Wrong Way - Part Three “The users don’t understand what 86% true means.” © Hortonworks Inc. 2012 15
  • 16. The Wrong Way - Part Four GHJIAEHGIEhjagigehganbanbigaebjnain!!!!!RJ(@J?!! © Hortonworks Inc. 2012 16
  • 17. The Wrong Way - Inevitable Conclusion Plane Mountain © Hortonworks Inc. 2012 17
  • 18. Reminds me of... © Hortonworks Inc. 2012 18
  • 19. Chief Problem You can’t design insight in analytics applications. You discover it. You discover by exploring. © Hortonworks Inc. 2012 19
  • 20. -> Strategy So make an app for exploring your data. Iterate and publish intermediate results. Which becomes a palette for what you ship. © Hortonworks Inc. 2012 20
  • 21. Data Design • Not the 1st query that = insight, its the 15th, or the 150th • Capturing “Ah ha!” moments • Slow to do those in batch... • Faster, better context in an interactive web application. • Pre-designed charts wind up terrible. So bad. • Easy to invest man-years in the wrong statistical models • Semantics of presenting predictions are complex, delicate • Opportunity lies at intersection of data & design © Hortonworks Inc. 2012 21
  • 22. How do we get back to Agile? © Hortonworks Inc. 2012 22
  • 23. Statement of Principles (then tricks, with code) © Hortonworks Inc. 2012 23
  • 24. Setup an environment where... • Insights repeatedly produced • Iterative work shared with entire team • Interactive from day 0 • Data model is consistent end-to-end • Minimal impedance between layers • Scope and depth of insights grow • Insights form the palette for what you ship • Until the application pays for itself and more © Hortonworks Inc. 2012 24
  • 25. Value document > relation Most data is dirty. Most data is semi-structured or un-structured. Rejoice! © Hortonworks Inc. 2012 25
  • 26. Value document > relation Note: Hive/ArrayQL/NewSQL’s support of documents/array types blur this distinction. © Hortonworks Inc. 2012 26
  • 27. Value imperative > declarative • We don’t know what we want to SELECT. • Data is dirty - check each step, clean iteratively. • 85% of data scientist’s time spent munging. See: ETL. • Imperative is optimized for our process. • Process = iterative, snowballing insight • Efficiency matters, self optimize © Hortonworks Inc. 2012 27
  • 28. Value dataflow > SELECT © Hortonworks Inc. 2012 28
  • 29. Example dataflow: ETL + email sent count © Hortonworks Inc. 2012 (I can’t read this either. Get a big version here.) 29
  • 30. Value Pig > Hive (for app-dev) • Pigs eat ANYTHING • Pig is optimized for refining data, as opposed to consuming it • Pig is imperative, iterative • Pig is dataflows, and SQLish (but not SQL) • Code modularization/re-use: Pig Macros • ILLUSTRATE speeds dev time (even UDFs) • Easy UDFs in Java, JRuby, Jython, Javascript • Pig Streaming = use any tool, period. • Easily prepare our data as it will appear in our app. • If you prefer Hive, use Hive. But actually, I wish Pig and Hive were one tool. Pig, then Hive, then Pig, then Hive... See: HCatalog for Pig/Hive integration, and this post. © Hortonworks Inc. 2012 30
  • 31. Localhost vs Petabyte scale: same tools • Simplicity essential to scalability: highest level tools we can • Prepare a good sample - tricky with joins, easy with documents • Local mode: pig -l /tmp -x local -v -w • Frequent use of ILLUSTRATE • 1st: Iterate, debug & publish locally • 2nd: Run on cluster, publish to team/customer • Consider skipping Object-Relational-Mapping (ORM) • We do not trust ‘databases,’ only HDFS @ n=3. • Everything we serve in our app is re-creatable via Hadoop. © Hortonworks Inc. 2012 31
  • 32. Data-Value Pyramid Climb it. Do not skip steps. See here. © Hortonworks Inc. 2012 32
  • 33. 0/1) Display atomic records on the web © Hortonworks Inc. 2012 33
  • 34. 0.0) Document-serialize events • Protobuf • Thrift • JSON • Avro - I use Avro because the schema is onboard. © Hortonworks Inc. 2012 34
  • 35. 0.1) Documents via Relation ETL enron_messages = load '/enron/enron_messages.tsv' as ( message_id:chararray, sql_date:chararray, from_address:chararray, from_name:chararray, subject:chararray, body:chararray );   enron_recipients = load '/enron/enron_recipients.tsv' as ( message_id:chararray, reciptype:chararray, address:chararray, name:chararray);   split enron_recipients into tos IF reciptype=='to', ccs IF reciptype=='cc', bccs IF reciptype=='bcc';   headers = cogroup tos by message_id, ccs by message_id, bccs by message_id parallel 10; with_headers = join headers by group, enron_messages by message_id parallel 10; emails = foreach with_headers generate enron_messages::message_id as message_id, CustomFormatToISO(enron_messages::sql_date, 'yyyy-MM-dd HH:mm:ss') as date, TOTUPLE(enron_messages::from_address, enron_messages::from_name) as from:tuple(address:chararray, name:chararray), enron_messages::subject as subject, enron_messages::body as body, headers::tos.(address, name) as tos, headers::ccs.(address, name) as ccs, headers::bccs.(address, name) as bccs; store emails into '/enron/emails.avro' using AvroStorage( Example here. © Hortonworks Inc. 2012 35
  • 36. 0.2) Serialize events from streams class GmailSlurper(object): ...   def init_imap(self, username, password):     self.username = username     self.password = password     try:       imap.shutdown()     except:       pass     self.imap = imaplib.IMAP4_SSL('imap.gmail.com', 993)     self.imap.login(username, password)     self.imap.is_readonly = True ...   def write(self, record):     self.avro_writer.append(record) ...   def slurp(self):     if(self.imap and self.imap_folder):       for email_id in self.id_list:         (status, email_hash, charset) = self.fetch_email(email_id)         if(status == 'OK' and charset and 'thread_id' in email_hash and 'froms' in email_hash):           print email_id, charset, email_hash['thread_id']           self.write(email_hash) © Hortonworks Inc. 2012 Scrape your own gmail in Python and Ruby. 36
  • 37. 0.3) ETL Logs log_data = LOAD 'access_log' USING org.apache.pig.piggybank.storage.apachelog.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes); © Hortonworks Inc. 2012 37
  • 38. 1) Plumb atomic events -> browser (Example stack that enables high productivity) © Hortonworks Inc. 2012 38
  • 39. Lots of Stack Options with Examples • Pig with Voldemort, Ruby, Sinatra: example • Pig with ElasticSearch: example • Pig with MongoDB, Node.js: example • Pig with Cassandra, Python Streaming, Flask: example • Pig with HBase, JRuby, Sinatra: example • Pig with Hive via HCatalog: example (trivial on HDP) • Up next: Accumulo, Redis, MySQL, etc. © Hortonworks Inc. 2012 39
  • 40. 1.1) cat our Avro serialized events me$ cat_avro ~/Data/enron.avro { u'bccs': [], u'body': u'scamming people, blah blah', u'ccs': [], u'date': u'2000-08-28T01:50:00.000Z', u'from': {u'address': u'bob.dobbs@enron.com', u'name': None}, u'message_id': u'<1731.10095812390082.JavaMail.evans@thyme>', u'subject': u'Re: Enron trade for frop futures', u'tos': [ {u'address': u'connie@enron.com', u'name': None} ] } © Hortonworks Inc. 2012 Get cat_avro in python, ruby 40
  • 41. 1.2) Load our events in Pig me$ pig -l /tmp -x local -v -w grunt> enron_emails = LOAD '/enron/emails.avro' USING AvroStorage(); grunt> describe enron_emails emails: { message_id: chararray, datetime: chararray, from:tuple(address:chararray,name:chararray) subject: chararray, body: chararray, tos: {to: (address: chararray,name: chararray)}, ccs: {cc: (address: chararray,name: chararray)}, bccs: {bcc: (address: chararray,name: chararray)} }   © Hortonworks Inc. 2012 41
  • 42. 1.3) ILLUSTRATE our events in Pig grunt> illustrate enron_emails   --------------------------------------------------------------------------- | emails | | message_id:chararray | | datetime:chararray | | from:tuple(address:chararray,name:chararray) | | subject:chararray | | body:chararray | | tos:bag{to:tuple(address:chararray,name:chararray)} | | ccs:bag{cc:tuple(address:chararray,name:chararray)} | | bccs:bag{bcc:tuple(address:chararray,name:chararray)} | --------------------------------------------------------------------------- | | | <1731.10095812390082.JavaMail.evans@thyme> | | 2001-01-09T06:38:00.000Z | | (bob.dobbs@enron.com, J.R. Bob Dobbs) | | Re: Enron trade for frop futures | | scamming people, blah blah | | {(connie@enron.com,)} | | {} | | {} | Upgrade to Pig 0.10+ © Hortonworks Inc. 2012 42
  • 43. 1.4) Publish our events to a ‘database’ From Avro to MongoDB in one command: pig -l /tmp -x local -v -w -param avros=enron.avro -param mongourl='mongodb://localhost/enron.emails' avro_to_mongo.pig Which does this: /* MongoDB libraries and configuration */ register /me/mongo-hadoop/mongo-2.7.3.jar register /me/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar register /me/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar /* Set speculative execution off to avoid chance of duplicate records in Mongo */ set mapred.map.tasks.speculative.execution false set mapred.reduce.tasks.speculative.execution false define MongoStorage com.mongodb.hadoop.pig.MongoStorage(); /* Shortcut */ /* By default, lets have 5 reducers */ set default_parallel 5 avros = load '$avros' using AvroStorage(); store avros into '$mongourl' using MongoStorage(); © Hortonworks Inc. 2012 Full instructions here. 43
  • 44. 1.5) Check events in our ‘database’ $ mongo enron MongoDB shell version: 2.0.2 connecting to: enron > show collections emails system.indexes > db.emails.findOne({message_id: "<1731.10095812390082.JavaMail.evans@thyme>"}) { " "_id" : ObjectId("502b4ae703643a6a49c8d180"), " "message_id" : "<1731.10095812390082.JavaMail.evans@thyme>", " "date" : "2001-01-09T06:38:00.000Z", " "from" : { "address" : "bob.dobbs@enron.com", "name" : "J.R. Bob Dobbs" }, " "subject" : Re: Enron trade for frop futures, " "body" : "Scamming more people...", " "tos" : [ { "address" : "connie@enron", "name" : null } ], " "ccs" : [ ], " "bccs" : [ ] } © Hortonworks Inc. 2012 44
  • 45. 1.6) Publish events on the web require 'rubygems' require 'sinatra' require 'mongo' require 'json' connection = Mongo::Connection.new database = connection['agile_data'] collection = database['emails'] get '/email/:message_id' do |message_id| data = collection.find_one({:message_id => message_id}) JSON.generate(data) end © Hortonworks Inc. 2012 45
  • 46. 1.6) Publish events on the web © Hortonworks Inc. 2012 46
  • 47. Whats the point? • A designer can work against real data. • An application developer can work against real data. • A product manager can think in terms of real data. • Entire team is grounded in reality! • You’ll see how ugly your data really is. • You’ll see how much work you have yet to do. • Ship early and often! • Feels agile, don’t it? Keep it up! © Hortonworks Inc. 2012 47
  • 48. 1.7) Wrap events with Bootstrap <link href="/static/bootstrap/docs/assets/css/bootstrap.css" rel="stylesheet"> </head> <body> <div class="container" style="margin-top: 100px;"> <table class="table table-striped table-bordered table-condensed"> <thead> {% for key in data['keys'] %} <th>{{ key }}</th> {% endfor %} </thead> <tbody> <tr> {% for value in data['values'] %} <td>{{ value }}</td> {% endfor %} </tr> </tbody> </table> </div> </body> Complete example here with code here. © Hortonworks Inc. 2012 48
  • 49. 1.7) Wrap events with Bootstrap © Hortonworks Inc. 2012 49
  • 50. Refine. Add links between documents. Not the Mona Lisa, but coming along... See: here © Hortonworks Inc. 2012 50
  • 51. The Mona Lisa. In pure CSS. © Hortonworks Inc. 2012 See: here 51
  • 52. 1.8) List links to sorted events Use Pig, serve/cache a bag/array of email documents: pig -l /tmp -x local -v -w emails_per_user = foreach (group emails by from.address) { sorted = order emails by date; last_1000 = limit sorted 1000; generate group as from_address, emails as emails; }; store emails_per_user into '$mongourl' using MongoStorage(); Use your ‘database’, if it can sort. mongo enron > db.emails.ensureIndex({message_id: 1}) > db.emails.find().sort({date:0}).limit(10).pretty() { { " "_id" : ObjectId("4f7a5da2414e4dd0645d1176"), " "message_id" : "<CA+bvURyn-rLcH_JXeuzhyq8T9RNq+YJ_Hkvhnrpk8zfYshL-wA@mail.gmail.com>", " "from" : [ ... © Hortonworks Inc. 2012 52
  • 53. 1.8) List links to sorted documents © Hortonworks Inc. 2012 53
  • 54. 1.9) Make it searchable... If you have list, search is easy with ElasticSearch and Wonderdog... /* Load ElasticSearch integration */ register '/me/wonderdog/target/wonderdog-1.0-SNAPSHOT.jar'; register '/me/elasticsearch-0.18.6/lib/*'; define ElasticSearch com.infochimps.elasticsearch.pig.ElasticSearchStorage(); emails = load '/me/tmp/emails' using AvroStorage(); store emails into 'es://email/email?json=false&size=1000' using ElasticSearch('/me/ elasticsearch-0.18.6/config/elasticsearch.yml', '/me/elasticsearch-0.18.6/plugins'); Test it with curl: curl -XGET 'http://localhost:9200/email/email/_search?q=hadoop&pretty=true&size=1' ElasticSearch has no security features. Take note. Isolate. © Hortonworks Inc. 2012 54
  • 55. From now on we speed up... Don’t worry, its in the book and on the blog. http://hortonworks.com/blog/ © Hortonworks Inc. 2012 55
  • 56. 2) Create Simple Charts © Hortonworks Inc. 2012 56
  • 57. 2) Create Simple Tables and Charts © Hortonworks Inc. 2012 57
  • 58. 2) Create Simple Charts • Start with an HTML table on general principle. • Then use nvd3.js - reusable charts for d3.js • Aggregate by properties & displaying is first step in entity resolution • Start extracting entities. Ex: people, places, topics, time series • Group documents by entities, rank and count. • Publish top N, time series, etc. • Fill a page with charts. • Add a chart to your event page. © Hortonworks Inc. 2012 58
  • 59. 2.1) Top N (of anything) in Pig pig -l /tmp -x local -v -w top_things = foreach (group things by key) { sorted = order things by arbitrary_rank desc; top_10_things = limit sorted 10; generate group as key, top_10_things as top_10_things; }; store top_n into '$mongourl' using MongoStorage(); Remember, this is the same structure the browser gets as json. This would make a good Pig Macro. © Hortonworks Inc. 2012 59
  • 60. 2.2) Time Series (of anything) in Pig pig -l /tmp -x local -v -w /* Group by our key and date rounded to the month, get a total */ things_by_month = foreach (group things by (key, ISOToMonth(datetime)) generate flatten(group) as (key, month), COUNT_STAR(things) as total; /* Sort our totals per key by month to get a time series */ things_timeseries = foreach (group things_by_month by key) { timeseries = order things by month; generate group as key, timeseries as timeseries; }; store things_timeseries into '$mongourl' using MongoStorage(); Yet another good Pig Macro. © Hortonworks Inc. 2012 60
  • 61. Data processing in our stack A new feature in our application might begin at any layer... great! omghi2u! I’m creative! I’m creative too! where r my legs? I know Pig! I <3 Javascript! send halp Any team member can add new features, no problemo! © Hortonworks Inc. 2012 61
  • 62. Data processing in our stack ... but we shift the data-processing towards batch, as we are able. See real example here. Ex: Overall total emails calculated in each layer © Hortonworks Inc. 2012 62
  • 63. 3) Exploring with Reports © Hortonworks Inc. 2012 63
  • 64. 3) Exploring with Reports © Hortonworks Inc. 2012 64
  • 65. 3.0) From charts to reports... • Extract entities from properties we aggregated by in charts (Step 2) • Each entity gets its own type of web page • Each unique entity gets its own web page • Link to entities as they appear in atomic event documents (Step 1) • Link most related entities together, same and between types. • More visualizations! • Parametize results via forms. © Hortonworks Inc. 2012 65
  • 66. 3.1) Looks like this... © Hortonworks Inc. 2012 66
  • 67. 3.2) Cultivate common keyspaces © Hortonworks Inc. 2012 67
  • 68. 3.3) Get people clicking. Learn. • Explore this web of generated pages, charts and links! • Everyone on the team gets to know your data. • Keep trying out different charts, metrics, entities, links. • See whats interesting. • Figure out what data needs cleaning and clean it. • Start thinking about predictions & recommendations. ‘People’ could be just your team, if data is sensitive. © Hortonworks Inc. 2012 68
  • 69. 4) Predictions and Recommendations © Hortonworks Inc. 2012 69
  • 70. 4.0) Preparation • We’ve already extracted entities, their properties and relationships • Our charts show where our signal is rich • We’ve cleaned our data to make it presentable • The entire team has an intuitive understanding of the data • They got that understanding by exploring the data • We are all on the same page! © Hortonworks Inc. 2012 70
  • 71. 4.1) Smooth sparse data © Hortonworks Inc. 2012 See here. 71
  • 72. 4.2) Think in different perspectives • Networks • Time Series • Distributions • Natural Language • Probability / Bayes © Hortonworks Inc. 2012 See here. 72
  • 73. 4.3) Sink more time in deeper analysis TF-IDF import 'tfidf.macro'; my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body'); /* Get the top 10 Tf*Idf scores per message */ per_message_cassandra = foreach (group tfidf_all by message_id) { sorted = order tfidf_all by value desc; top_10_topics = limit sorted 10; generate group, top_10_topics.(score, value); } Probability / Bayes sent_replies = join sent_counts by (from, to), reply_counts by (from, to); reply_ratios = foreach sent_replies generate sent_counts::from as from, sent_counts::to as to, (float)reply_counts::total/(float)sent_counts::tot as ratio; reply_ratios = foreach reply_ratios generate from, to, (ratio > 1.0 ? 1.0 : ratio) as ratio; © Hortonworks Inc. 2012 Example with code here and macro here. 73
  • 74. 4.4) Add predictions to reports © Hortonworks Inc. 2012 74
  • 75. 5) Enable new actions © Hortonworks Inc. 2012 75
  • 76. Example: Packetpig and PacketLoop snort_alerts = LOAD '$pcap'   USING com.packetloop.packetpig.loaders.pcap.detection.SnortLoader('$snortconfig'); countries = FOREACH snort_alerts   GENERATE     com.packetloop.packetpig.udf.geoip.Country(src) as country,     priority; countries = GROUP countries BY country; countries = FOREACH countries   GENERATE     group,     AVG(countries.priority) as average_severity; STORE countries into 'output/choropleth_countries' using PigStorage(','); Code here. © Hortonworks Inc. 2012 76
  • 77. Example: Packetpig and PacketLoop © Hortonworks Inc. 2012 77
  • 78. Hortonworks Data Platform • Simplify deployment to get started quickly and easily • Monitor, manage any size cluster with familiar console and tools • Only platform to include data 1 integration services to interact with any data • Metadata services opens the platform for integration with existing applications • Dependable high availability architecture  Reduce risks and cost of adoption • Tested at scale to future proof  Lower the total cost to administer and provision your cluster growth  Integrate with your existing ecosystem © Hortonworks Inc. 2012 78
  • 79. Hortonworks Training The expert source for Apache Hadoop training & certification Role-based Developer and Administration training – Coursework built and maintained by the core Apache Hadoop development team. – The “right” course, with the most extensive and realistic hands-on materials – Provide an immersive experience into real-world Hadoop scenarios – Public and Private courses available Comprehensive Apache Hadoop Certification – Become a trusted and valuable Apache Hadoop expert © Hortonworks Inc. 2012 79
  • 80. Next Steps? 1 Download Hortonworks Data Platform hortonworks.com/download 2 Use the getting started guide hortonworks.com/get-started 3 Learn more… get support Hortonworks Support • Expert role based training • Full lifecycle technical support • Course for admins, developers across four service levels and operators • Delivered by Apache Hadoop • Certification program Experts/Committers • Custom onsite options • Forward-compatible hortonworks.com/training hortonworks.com/support © Hortonworks Inc. 2012 80
  • 81. Thank You! Questions & Answers Follow: @hortonworks and @rjurney Read: hortonworks.com/blog © Hortonworks Inc. 2012 81

Hinweis der Redaktion

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n
  58. \n
  59. \n
  60. \n
  61. \n
  62. \n
  63. \n
  64. \n
  65. \n
  66. \n
  67. \n
  68. \n
  69. \n
  70. \n
  71. \n
  72. \n
  73. \n
  74. \n
  75. \n
  76. \n
  77. \n
  78. Hortonworks Data Platform (HDP) is the only 100% open source Apache Hadoop distribution that provides a complete and reliable foundation for enterprises that want to build, deploy and manage big data solutions. It allows you to confidently capture, process and share data in any format, at scale on commodity hardware and/or in a cloud environment. \n\nAs the foundation for the next generation enterprise data architecture, HDP delivers all of the necessary components to uncover business insights from the growing streams of data flowing into and throughout your business. HDP is a fully integrated data platform that includes the stable core functions of Apache Hadoop (HDFS and MapReduce), the baseline tools to process big data (Apache Hive, Apache HBase, Apache Pig) as well as a set of advanced capabilities (Apache Ambari, Apache HCatalog and High Availability) that make big data operational and ready for the enterprise. &amp;#xA0;\n\nRun through the points on left&amp;#x2026;\n
  79. \n
  80. \n
  81. \n