SlideShare ist ein Scribd-Unternehmen logo
1 von 55
The Power of Hadoop to
   Transform Business

©MapR Technologies - Confidential   1
My Background

     University, Startups
       – Aptex, MusicMatch, ID Analytics, Veoh
       – big data since before it was big

     Open source
       – even before the internet
       – Apache Hadoop, Mahout, Zookeeper, Drill
       – bought the beer at first HUG

 MapR
 Founding member of Apache Drill


©MapR Technologies - Confidential       2
MapR Technologies

     Silicon Valley Startup
       – Top investors
       – Top technical and management team
            •    Google, Microsoft, EMC, NetApp, Oracle
     Enterprise quality distribution for
      Hadoop
     Many extensions to basic Hadoop function
     Strong supporter of Apache Drill



©MapR Technologies - Confidential                3
Philosophy First




                              What is History?



©MapR Technologies - Confidential    4
The study of the past

(what came before now)


©MapR Technologies - Confidential   5
What is the future?

        (it comes after now)


©MapR Technologies - Confidential   6
©MapR Technologies - Confidential   7
©MapR Technologies - Confidential   8
©MapR Technologies - Confidential   9
But the future also
                     has a past!



©MapR Technologies - Confidential   10
Do you remember the
                       future?



©MapR Technologies - Confidential   11
©MapR Technologies - Confidential   12
©MapR Technologies - Confidential   13
©MapR Technologies - Confidential   14
©MapR Technologies - Confidential   15
©MapR Technologies - Confidential   16
Some things
                                    turned out
                                    as expected

©MapR Technologies - Confidential        17
Many things
                                        are
                                    different!

©MapR Technologies - Confidential        19
Hadoop has
                                     a history


©MapR Technologies - Confidential       20
Hadoop also
                                       has a
                                      future

©MapR Technologies - Confidential        21
The Old Future of Hadoop

     Map-reduce and HDFS
       –   more and more, but not really different


     Eco-system additions
       –   Simpler programming (Hive and Pig)
       –   Key-value store
       –   Ad hoc query


     Stands apart from other computing
       –   Required by HDFS and other limitations




©MapR Technologies - Confidential            22
The New Future of Hadoop

     Real-time processing
       –   Combines real-time and long-time


     Integration with traditional IT
       –   No need to stand apart


     Integration with new technologies
       –   Solr, Node.js, Twisted all should interface directly


     Fast and flexible computation
       –   Drill logical plan language


©MapR Technologies - Confidential               23
Example #1
                                    Search Abuse


©MapR Technologies - Confidential        24
History matrix

                                    One row per user

                                    One column per thing




©MapR Technologies - Confidential        25
Recommendation based on
                                    cooccurrence

                                    Cooccurrence gives item-item
                                    mapping

                                    One row and column per thing




©MapR Technologies - Confidential         26
Cooccurrence matrix can also be
                                    implemented as a search index




©MapR Technologies - Confidential         27
SolR
                                                              SolR
                          Complete    Cooccurrence            Indexer
                                                            Solr
                                                            Indexer
                            history     (Mahout)          indexing




                                        Item meta-             Index
                                           data               shards




©MapR Technologies - Confidential                    28
SolR
                                                             SolR
                                User                         Indexer
                                                           Solr
                                        Web tier           Indexer
                              history                     search




                                        Item meta-
                                                              Index
                                           data              shards




©MapR Technologies - Confidential                    29
Objective Results

     At a very large credit card company


     History is all transactions, all web interaction


     Processing time cut from 20 hours per day to 3


     Recommendation engine load time decreased from 8 hours to 3
      minutes




©MapR Technologies - Confidential       30
Example #2
                         Web Technology


©MapR Technologies - Confidential   31
Real-time   Fast analysis
                                         data     (Storm)




                                                   Analytic
                                                                   Raw logs
                                                   output




©MapR Technologies - Confidential                             32
Large analysis
                                                    (map-reduce)




                                    Analytic
                                                       Raw logs
                                    output




©MapR Technologies - Confidential              33
Presentation
                                    Browser
                                                tier (d3 +
                                      query
                                                 node.js)




                                                 Analytic
                                                                 Raw logs
                                                 output




©MapR Technologies - Confidential                           34
Objective Results

     Real-time + long-time analysis is seamless


     Web tier can be rooted directly on Hadoop cluster


     No need to move data




©MapR Technologies - Confidential     35
Example #3
                                    Apache Drill


©MapR Technologies - Confidential         36
Big Data Processing – Hadoop

                                    Batch processing
  Query runtime                     Minutes to hours

  Data volume                       TBs to PBs
  Programming                       MapReduce
  model
  Users                             Developers

  Google project                    MapReduce
  Open source                       Hadoop
  project                           MapReduce




©MapR Technologies - Confidential                      37
Big Data Processing – Hadoop and Storm

                                    Batch processing   Stream processing
  Query runtime                     Minutes to hours   Never-ending

  Data volume                       TBs to PBs         Continuous stream
  Programming                       MapReduce          DAG
  model                                                (pre-programmed)
  Users                             Developers         Developers

  Google project                    MapReduce
  Open source                       Hadoop             Storm or Apache S4
  project                           MapReduce




©MapR Technologies - Confidential                       38
Big Data Processing – The missing part

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours                          Never-ending

  Data volume                       TBs to PBs                                Continuous stream
  Programming                       MapReduce                                 DAG
  model                                                                       (pre-programmed)
  Users                             Developers                                Developers

  Google project                    MapReduce
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce




©MapR Technologies - Confidential                       39
Big Data Processing – The missing part

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours   Milliseconds to        Never-ending
                                                       minutes
  Data volume                       TBs to PBs         GBs to PBs             Continuous stream
  Programming                       MapReduce          Queries                DAG
  model                                                (ad hoc)               (pre-programmed)
  Users                             Developers         Analysts and           Developers
                                                       developers
  Google project                    MapReduce
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce




©MapR Technologies - Confidential                       40
Big Data Processing

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours   Milliseconds to        Never-ending
                                                       minutes
  Data volume                       TBs to PBs         GBs to PBs             Continuous stream
  Programming                       MapReduce          Queries                DAG
  model
  Users                             Developers         Analysts and           Developers
                                                       developers
  Google project                    MapReduce          Dremel
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce




©MapR Technologies - Confidential                       41
Big Data Processing

                                    Batch processing   Interactive analysis   Stream processing
  Query runtime                     Minutes to hours   Milliseconds to        Never-ending
                                                       minutes
  Data volume                       TBs to PBs         GBs to PBs             Continuous stream
  Programming                       MapReduce          Queries                DAG
  model
  Users                             Developers         Analysts and           Developers
                                                       developers
  Google project                    MapReduce          Dremel
  Open source                       Hadoop                                    Storm and S4
  project                           MapReduce


                                    Apache Drill
©MapR Technologies - Confidential                       42
Design Principles

             Flexible                               Easy
             • Pluggable query languages            •   Unzip and run
             • Extensible execution engine          •   Zero configuration
             • Pluggable data formats               •   Reverse DNS not needed
               • Column-based and row-based         •   IP addresses can change
               • Schema and schema-less             •   Clear and concise log messages
             • Pluggable data sources


             Dependable                             Fast
             • No SPOF                              • C/C++ core with Java support
             • Instant recovery from crashes          • Google C++ style guide
                                                    • Min latency and max throughput
                                                      (limited only by hardware)




©MapR Technologies - Confidential              43
Simple Architecture

                                                  Query
                                    Interface
                                                language



                                                 Logical
                                    Transform
                                                Language



                                                 Physical
                                     Optimize               Execute
                                                  Plan




©MapR Technologies - Confidential                    44
Standard Interfaces

                                                  Query     SQL 2003
                                    Interface
                                                language
                                                              Drill logical
                                                                syntax

                                                 Logical
                                    Transform                           Scanner
                                                Language                  API



                                                 Physical
                                     Optimize                   Execute
                                                  Plan




©MapR Technologies - Confidential                     45
Logical Plan Syntax:

                                    query:[
                                     {
                                       op:"sequence", do:[
                                         { op: "scan",
                                            memo: "initial_scan",
                                            ref: "donuts",
                                            source: "local-logs",
                                            selection: {data: "activity"}
                                         },
                                         { op: "transform",
                                            transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ]
                                         },
                                         { op: "filter",
                                            expr: "donuts.ppu < 1.00"
                                         },
                                       …




©MapR Technologies - Confidential                                    46
Logical Streaming Example

                                    01
                                    23
                                    4



      { @id: <refnum>, op: “window-frame”,
       input: <input>,
       keys: [                                    0
         <name>,...                               01
       ],                                         012
       ref: <name>,                               123
       before: 2,                                 234
       after: here
      }


©MapR Technologies - Confidential            47
Logical Plan

                                    scan-json        "table-1"



                                      filter     exp1



                                     flatten



                                    aggregate        exp2



©MapR Technologies - Confidential               48
Execution Plan

             scan-json                     "table-1"   scan-json               "table-1"   scan-json          "table-1"




                 filter              exp1                 filter          exp1                 filter     exp1




                flatten                                  flatten                              flatten
                                           node1                               node2                           node3



                                                       aggregate           exp2




©MapR Technologies - Confidential                                  49
Representing a DAG

                                18


                       aggregate       exp2


                                19
                                     { @id: 19, op: "aggregate",
                                       input: 18,
                                       type: <simple|running|repeat>,
                                       keys: [<name>,...],
                                       aggregations: [
                                         {ref: <name>, expr: <aggexpr> },...
                                       ]
                                     }




©MapR Technologies - Confidential                50
Non-SQL queries
                             scan-json   "table-1"        scan-json   "table-1"


                                                          streaming
                                                           k-means


                                                           ball k-
                                                                      k
                                                           means


                                                          aggregate   exp2



                                              k-means
                                                join


                                               cluster
                                              features

©MapR Technologies - Confidential                    51
Design Principles

             Flexible                               Easy
             • Pluggable query languages            •   Unzip and run
             • Extensible execution engine          •   Zero configuration
             • Pluggable data formats               •   Reverse DNS not needed
               • Column-based and row-based         •   IP addresses can change
               • Schema and schema-less             •   Clear and concise log messages
             • Pluggable data sources


             Dependable                             Fast
             • No SPOF                              • C/C++ core with Java support
             • Instant recovery from crashes          • Google C++ style guide
                                                    • Min latency and max throughput
                                                      (limited only by hardware)




©MapR Technologies - Confidential              52
The future is
                               not what we
                                thought it
                                 would be

©MapR Technologies - Confidential   53
It is better!



©MapR Technologies - Confidential   54
Get Involved!


                                        Tweet:
                                       #hcj13w
                                         #mapr
                                    @ted_dunning


©MapR Technologies - Confidential         55
Get Involved!

     Download these slides
       –   http://www.mapr.com/company/events/hcj-01-21-2013

     Join the Drill project
       – drill-dev-subscribe@incubator.apache.org
       – #apachedrill


     Contact me:
       – tdunning@maprtech.com
       – tdunning@apache.org
       – @ted_dunning


     Join MapR (in Japan!)
       –   jobs@mapr.com


©MapR Technologies - Confidential           56

Weitere ähnliche Inhalte

Ähnlich wie Hcj 2013-01-21

predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21Ted Dunning
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in businessMapR Technologies
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San DiegoMapR Technologies
 
How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastMapR Technologies
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGMapR Technologies
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...DataWorks Summit
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?Ted Dunning
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsEllen Friedman
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRData Science London
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 
London data science
London data scienceLondon data science
London data scienceTed Dunning
 
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Cloudera, Inc.
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...
2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...
2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...GIS in the Rockies
 

Ähnlich wie Hcj 2013-01-21 (20)

predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21predictive-analytics-san-diego-2013-02-21
predictive-analytics-san-diego-2013-02-21
 
The power of hadoop in business
The power of hadoop in businessThe power of hadoop in business
The power of hadoop in business
 
Predictive Analytics San Diego
Predictive Analytics San DiegoPredictive Analytics San Diego
Predictive Analytics San Diego
 
How to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and FastHow to Make Hadoop Easy, Dependable and Fast
How to Make Hadoop Easy, Dependable and Fast
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Introduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUGIntroduction to Mahout given at Twin Cities HUG
Introduction to Mahout given at Twin Cities HUG
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
What is the past future tense of data?
What is the past future tense of data?What is the past future tense of data?
What is the past future tense of data?
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
Super-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapRSuper-Fast Clustering Report in MapR
Super-Fast Clustering Report in MapR
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 
London data science
London data scienceLondon data science
London data science
 
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...
2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...
2012 Workshop, Introduction to LiDAR Workshop, Bruce Adey and Mark Stucky (Me...
 

Mehr von Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxTed Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with KubernetesTed Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in KubernetesTed Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forTed Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningTed Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning LogisticsTed Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTed Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logisticsTed Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real DataTed Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteTed Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoopTed Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data SecurelyTed Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownTed Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossibleTed Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningTed Dunning
 

Mehr von Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Kürzlich hochgeladen

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Hcj 2013-01-21

  • 1. The Power of Hadoop to Transform Business ©MapR Technologies - Confidential 1
  • 2. My Background  University, Startups – Aptex, MusicMatch, ID Analytics, Veoh – big data since before it was big  Open source – even before the internet – Apache Hadoop, Mahout, Zookeeper, Drill – bought the beer at first HUG  MapR  Founding member of Apache Drill ©MapR Technologies - Confidential 2
  • 3. MapR Technologies  Silicon Valley Startup – Top investors – Top technical and management team • Google, Microsoft, EMC, NetApp, Oracle  Enterprise quality distribution for Hadoop  Many extensions to basic Hadoop function  Strong supporter of Apache Drill ©MapR Technologies - Confidential 3
  • 4. Philosophy First What is History? ©MapR Technologies - Confidential 4
  • 5. The study of the past (what came before now) ©MapR Technologies - Confidential 5
  • 6. What is the future? (it comes after now) ©MapR Technologies - Confidential 6
  • 7. ©MapR Technologies - Confidential 7
  • 8. ©MapR Technologies - Confidential 8
  • 9. ©MapR Technologies - Confidential 9
  • 10. But the future also has a past! ©MapR Technologies - Confidential 10
  • 11. Do you remember the future? ©MapR Technologies - Confidential 11
  • 12. ©MapR Technologies - Confidential 12
  • 13. ©MapR Technologies - Confidential 13
  • 14. ©MapR Technologies - Confidential 14
  • 15. ©MapR Technologies - Confidential 15
  • 16. ©MapR Technologies - Confidential 16
  • 17. Some things turned out as expected ©MapR Technologies - Confidential 17
  • 18. Many things are different! ©MapR Technologies - Confidential 19
  • 19. Hadoop has a history ©MapR Technologies - Confidential 20
  • 20. Hadoop also has a future ©MapR Technologies - Confidential 21
  • 21. The Old Future of Hadoop  Map-reduce and HDFS – more and more, but not really different  Eco-system additions – Simpler programming (Hive and Pig) – Key-value store – Ad hoc query  Stands apart from other computing – Required by HDFS and other limitations ©MapR Technologies - Confidential 22
  • 22. The New Future of Hadoop  Real-time processing – Combines real-time and long-time  Integration with traditional IT – No need to stand apart  Integration with new technologies – Solr, Node.js, Twisted all should interface directly  Fast and flexible computation – Drill logical plan language ©MapR Technologies - Confidential 23
  • 23. Example #1 Search Abuse ©MapR Technologies - Confidential 24
  • 24. History matrix One row per user One column per thing ©MapR Technologies - Confidential 25
  • 25. Recommendation based on cooccurrence Cooccurrence gives item-item mapping One row and column per thing ©MapR Technologies - Confidential 26
  • 26. Cooccurrence matrix can also be implemented as a search index ©MapR Technologies - Confidential 27
  • 27. SolR SolR Complete Cooccurrence Indexer Solr Indexer history (Mahout) indexing Item meta- Index data shards ©MapR Technologies - Confidential 28
  • 28. SolR SolR User Indexer Solr Web tier Indexer history search Item meta- Index data shards ©MapR Technologies - Confidential 29
  • 29. Objective Results  At a very large credit card company  History is all transactions, all web interaction  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes ©MapR Technologies - Confidential 30
  • 30. Example #2 Web Technology ©MapR Technologies - Confidential 31
  • 31. Real-time Fast analysis data (Storm) Analytic Raw logs output ©MapR Technologies - Confidential 32
  • 32. Large analysis (map-reduce) Analytic Raw logs output ©MapR Technologies - Confidential 33
  • 33. Presentation Browser tier (d3 + query node.js) Analytic Raw logs output ©MapR Technologies - Confidential 34
  • 34. Objective Results  Real-time + long-time analysis is seamless  Web tier can be rooted directly on Hadoop cluster  No need to move data ©MapR Technologies - Confidential 35
  • 35. Example #3 Apache Drill ©MapR Technologies - Confidential 36
  • 36. Big Data Processing – Hadoop Batch processing Query runtime Minutes to hours Data volume TBs to PBs Programming MapReduce model Users Developers Google project MapReduce Open source Hadoop project MapReduce ©MapR Technologies - Confidential 37
  • 37. Big Data Processing – Hadoop and Storm Batch processing Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm or Apache S4 project MapReduce ©MapR Technologies - Confidential 38
  • 38. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce ©MapR Technologies - Confidential 39
  • 39. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model (ad hoc) (pre-programmed) Users Developers Analysts and Developers developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce ©MapR Technologies - Confidential 40
  • 40. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce ©MapR Technologies - Confidential 41
  • 41. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Apache Drill ©MapR Technologies - Confidential 42
  • 42. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware) ©MapR Technologies - Confidential 43
  • 43. Simple Architecture Query Interface language Logical Transform Language Physical Optimize Execute Plan ©MapR Technologies - Confidential 44
  • 44. Standard Interfaces Query SQL 2003 Interface language Drill logical syntax Logical Transform Scanner Language API Physical Optimize Execute Plan ©MapR Technologies - Confidential 45
  • 45. Logical Plan Syntax: query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, … ©MapR Technologies - Confidential 46
  • 46. Logical Streaming Example 01 23 4 { @id: <refnum>, op: “window-frame”, input: <input>, keys: [ 0 <name>,... 01 ], 012 ref: <name>, 123 before: 2, 234 after: here } ©MapR Technologies - Confidential 47
  • 47. Logical Plan scan-json "table-1" filter exp1 flatten aggregate exp2 ©MapR Technologies - Confidential 48
  • 48. Execution Plan scan-json "table-1" scan-json "table-1" scan-json "table-1" filter exp1 filter exp1 filter exp1 flatten flatten flatten node1 node2 node3 aggregate exp2 ©MapR Technologies - Confidential 49
  • 49. Representing a DAG 18 aggregate exp2 19 { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] } ©MapR Technologies - Confidential 50
  • 50. Non-SQL queries scan-json "table-1" scan-json "table-1" streaming k-means ball k- k means aggregate exp2 k-means join cluster features ©MapR Technologies - Confidential 51
  • 51. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware) ©MapR Technologies - Confidential 52
  • 52. The future is not what we thought it would be ©MapR Technologies - Confidential 53
  • 53. It is better! ©MapR Technologies - Confidential 54
  • 54. Get Involved! Tweet: #hcj13w #mapr @ted_dunning ©MapR Technologies - Confidential 55
  • 55. Get Involved!  Download these slides – http://www.mapr.com/company/events/hcj-01-21-2013  Join the Drill project – drill-dev-subscribe@incubator.apache.org – #apachedrill  Contact me: – tdunning@maprtech.com – tdunning@apache.org – @ted_dunning  Join MapR (in Japan!) – jobs@mapr.com ©MapR Technologies - Confidential 56