SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Log everything!
Dr. Stefan Schadwinkel und Mike Lohmann




                                          1	
  
Who we are.




               Dr. Stefan Schadwinkel                            Mike Lohmann
                       Analytics                                   Architektur
Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.)   Author (PHPMagazin, IX, heise.de)




                                                                            Log everything   2	
  
                                                                                             2
Agenda.




 §  What we do. What we need to do. What we are doing.

 §  Requirement: Log everything!

 §  Infrastructure and technologies.

 §  We want happy business users.




 	
  




                                                          Log everything   3	
  
                                                                           3
Icans GmbH




             Log everything   4	
  
                              4
Numberfacts of PokerStrategy.com




    7.600.000
    Requests/Day
                                       PokerStrategy.com
                                       Education since 2005




 6.000.000                                 19 Languages
 Registered Users




             2.800.000             700.000
             PI/Day                Posts/Day



                                                  Log everything   5	
  
                                                                   5
Topics of this talk




- How to use existing technologies and standards.   - Out of the box solution

- Scalability and simplicity of the solution        - Ready to use scripts	
  

- „Good enough“ for now!

- Showing way from requirement to solution.

- OpenSource Sf2 bundles for logging.

- Livedemo.




                                                                          Log everything   6	
  
                                                                                           6
What we do.




 §  We teach Poker.

 §  We create webapplications.

 §  We serve millions of users in different countries respecting

   a multitude of market rules.

 §  We make business decisions driven by complex

   data analytics.




                                                                    Log everything   7	
  
                                                                                     7
What we need to do.




 §  We need to try out other teaching topics, fast.

 §  We need to gather data from all of these „try outs“ to accumulate them

   and build business decisions on their analysis.

 §  We need a bigger infrastructure to gather more data.

 §  We need to hire more (good) people! J




                                                                              Log everything   8	
  
                                                                                               8
What we are doing.




 §  We build ECF (Education Community Framework).

 §  We (can) log everything!

 §  We (now) use Amazon S3 and Amazon EMR to have a scaling

   storage and map reduce solution.

 §  We hire (good) people! J




                                                               Log everything   9	
  
                                                                                9
Requirement: Log everything.




 §  „Are you mad?!“

 §  „Be more specific, please!“

 §  „But what about the user‘s data?!“
      	
  




                                          Log everything   10	
  
                                                           10
Logging Tools / Technologies




   Producer          Transport           Storage            Analytics

   Symfony2              Now:             Now:              MapReduce
                       RabbitMQ         S3 Storage             Hive
   Application     Erlang Consumer      Hadoop via
   Server and                            Amazon            BI via QlikView
                       Was:                EMR
   Databases
                       Flume
                                             Was:
                                     Virtualized Inhouse
                                           Hadoop




                                                                 15.10.12    11	
  
                                                                             11
Logging Infrastructure




     Producer              Transport          Storage      Analytics

         Databases                             Hadoop
                                               - Cluster


                                                           QlikView	
  

                 App
       Reverse
                 1-x
LB      Proxy




                                                 S3         Graylog	
  

                                   Consumer
                                                            Zabbix	
  
                       Rabbit MQ



                                                                15.10.12   12	
  
                                                                           12
Producer




           /Home    Page
                   Controller
                                PageHit-Event
                   PageHit
                    Event                                  Shovel
                                Logger::log()
                   Listener


                   Monolog-                       Local
                    Logger                      RabbitMQ

                   Processor

                   Formatter               LogMessage, JSON

                   Handler



                                                                    15.10.12   13	
  
                                                                               13
Producer




 §  LoggingComponent: Provides interfaces, filters and handlers

 §  LoggingBundle: Glues all together with Symfony2
      	
  




  h=ps://github.com/ICANS/IcansLoggingComponent	
  
  h=ps://github.com/ICANS/IcansLoggingBundle	
  
  	
  




                                                                   15.10.12   14	
  
                                                                              14
Transport – First Try




  §  Hey, if we use Hadoop, why not use Flume?

      -  Part of the Ecosystem
      -  Central config
      -  Extensible via Plugins
      -  Flexible Flow Configuration
      -  How? : Flume Nodes à Flume Sinks




                                                  15.10.12   15	
  
                                                             15
Transport – First Try




  §  But, .. wait!
        -  Ecosystem? Just like Hadoop version numbers…
        -  Admins say: Central config woes!
        -  issues: multi-master, logical vs. physical nodes, Java heap
           space, etc.
        -  Will my plugin run with flume-ng?
        -  Ever tried to keep your complex flow and switch reliability levels?


        Read: Our admins still hate me …




                                                                                 15.10.12   16	
  
                                                                                            16
Transport – Second Try




 §  RabbitMQ vs. Flume Nodes
     -  Each app server has ist own local RabbitMQ
     -  The local RabbitMQ shovels ist data to a central RabbitMQ
       cluster
     -  Similar to the Flume Node concept
     -  Decentralized config: Producers and consumers simply connect




                                                                       15.10.12   17	
  
                                                                                  17
Transport – Second Try




 §  But, .. wait! We still need Sinks.
      -  Custom crafted RabbitMQ consumers
      -  We could write them in PHP, but ..


      -  Erlang, teh awesome!
            -  Battle-hardened OTP framework.
            -  „Let it crash!“ .. and recover.
            -  Hot code change. If you want.


      Read: Runs forever.




                                                 15.10.12   18	
  
                                                            18
Storage – First Try




                      §  Use out-of-the-box Hadoop (Cloudera)

                      §  But:
                            -  Virtualized Infrastructure
                            -  Unknown usage patterns
      Hadoop
                            -  Must be cost effective
                            -  Major Hadoop version upgrades




                                                                 15.10.12   19	
  
                                                                            19
Storage – Second Try




                       §  Use Amazon Webservices

                       §  Provides flexible virtualized infrastructure

                       §  Cost-effective storage: S3
    Amazon S3
                       §  Hadoop on demand: EMR




                                                                          15.10.12   20	
  
                                                                                     20
Storage – Storage Amazon S3




                   §  Erlang RabbitMQ consumer simply copies the

                     incoming data to S3

                       - Easy: exchange „hadoop“ command with „s3cmd“
    Amazon S3




                                                                    15.10.12   21	
  
                                                                               21
Storage – Storage Amazon S3




                   §  S3 bucket receives many small, compressed log file chunks

                   §  Amazon provides s3DistCp which does distributed data copy:

                       -  Aggregate many small files into partitioned large chunks
    Amazon S3
                       -  Change compression




                                                                        15.10.12     22	
  
                                                                                     22
Analytics




 §  We want happy business users.

 §  We want to answer questions.
        -  People want answers to questions they have. Now.
        -  No, they couldn‘t tell you that question yesterday. If they had
          known, they would have already asked for the answer. Yesterday.

 §  We also want data-driven applications.
        -  Production system analysis.
        -  Fraud prevention.
        -  Recommendations.
        -  Social metrics for our users.
 	
  


                                                                             15.10.12   23	
  
                                                                                        23
Analytics




 §  Remember MapReduce.

        -  Custom Jobs.
            -  Streaming: Use your favorite.
            -  Java API: Cascading. Use your favorite: Java, Groovy, Clojure,
              Scala.

        -  Data Queries.
            -  Hive: similar to SQL.
            -  Pig: Data flow.
            -  Cascalog: Datalog-like QL using Clojure and Cascading.
 	
  




                                                                                15.10.12   24	
  
                                                                                           24
Analytics




 §  Cascalog is Clojure, Clojure is Lisp


  (?<- (stdout)          [?person]       (age ?person ?age) … (< ?age 30))

  Query     Cascading     Columns of          „Generator“                 „Predicate“
 Operator   Output Tap    the dataset
                           generated
                          by the query

                                                  §  as many as you want
                                                  §  both can be any clojure function
                                                  §  clojure can call anything that is
                                                     available within a JVM




                                                                                        15.10.12   25	
  
                                                                                                   25
Analytics




 §  We use Cascalog to preprocess and organize that incoming flow of log messages:




                                                                             15.10.12   26	
  
                                                                                        26
Analytics




 §  Let‘s run the Cascalog processing on Amazon EMR:


    ./elastic-mapreduce --create --name „Log Message Compaction"
    --bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons
    --num-instances $NUM
    --slave-instance-type m1.large
    --master-instance-type m1.large
    --jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar
    --step-action TERMINATE_JOB_FLOW
    --step-name "Cascalog"
    --main-class icans.cascalogjobs.processing.compaction
    --args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error




                                                                                        15.10.12    27	
  
                                                                                                    27
Analytics




 §  After the Cascalog Query we have:


   s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo



                                                     Hive	
  ParSSoning!	
  




                                                                                  15.10.12   28	
  
                                                                                             28
Analytics




 §  Now	
  we	
  can	
  access	
  the	
  log	
  data	
  within	
  Hive:




                                                                           15.10.12   29	
  
                                                                                      29
Analytics




 §  Now	
  we	
  can	
  run	
  Hive	
  queries	
  on	
  the	
  [WEBSITE]_icanslog_content	
  table!	
  
 §  But	
  we	
  also	
  want	
  to	
  store	
  the	
  result	
  to	
  S3.




                                                                                                           15.10.12   30	
  
                                                                                                                      30
Analytics




 §  Now,	
  get	
  the	
  stats:




                                    15.10.12   31	
  
                                               31
Analytics




 §  We can now simply copy the data from S3 and import in any local analytical tool, like:

    -  Excel (It must really make business people happy…)
    -  QlikView (Anyone can be happy with it…)
    -  R (If I want an answer…)




                                                                                 15.10.12     32	
  
                                                                                              32
Merci.




         ?
         Questions




                     15.10.12   33	
  
                                33
Contacts.




       Dr. Stefan Schadwinkel               Mike Lohmann
  stefan.schadwinkel@icans-gmbh.com   mike.lohmann@icans-gmbh.com
            ICANS_StScha                     mikelohmann




                                                           15.10.12   34	
  
                                                                      34
Tools/Technologies




                     15.10.12   35	
  
                                35
ICANS GmbH
Valentinskamp 18
20354 Hamburg
Germany


Phone:   +49 40 22 63 82 9-0
Fax:     +49 40 38 67 15 92


Web: www.icans-gmbh.com
	
  



                               36	
  

Weitere ähnliche Inhalte

Ähnlich wie Log everything!

Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Trayan Iliev
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
K8Guard - An Auditing System For Kubernetes
K8Guard - An Auditing System For KubernetesK8Guard - An Auditing System For Kubernetes
K8Guard - An Auditing System For KubernetesMedya Ghazizadeh
 
Java User Group Freiburg - Internet of Things für Java-Entwickler
Java User Group Freiburg - Internet of Things für Java-EntwicklerJava User Group Freiburg - Internet of Things für Java-Entwickler
Java User Group Freiburg - Internet of Things für Java-EntwicklerMarcus Munzert
 
BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...
BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...
BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...MindShare_kk
 
Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing VideoTuri, Inc.
 
Yannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflowYannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflowMarynaHoldaieva
 
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...Lviv Startup Club
 
Fluentd meetup logging infrastructure in paa s
Fluentd meetup   logging infrastructure in paa sFluentd meetup   logging infrastructure in paa s
Fluentd meetup logging infrastructure in paa sRakuten Group, Inc.
 
SplunkLive! Detroit April 2013 - Domino's Pizza
SplunkLive! Detroit April 2013 - Domino's PizzaSplunkLive! Detroit April 2013 - Domino's Pizza
SplunkLive! Detroit April 2013 - Domino's PizzaSplunk
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backIcinga
 
Overpowered Kubernetes: CI/CD for K8s on Enterprise IaaS
Overpowered Kubernetes: CI/CD for K8s on Enterprise IaaSOverpowered Kubernetes: CI/CD for K8s on Enterprise IaaS
Overpowered Kubernetes: CI/CD for K8s on Enterprise IaaSJ On The Beach
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Splunk as a_big_data_platform_for_developers_spring_one2gx
Splunk as a_big_data_platform_for_developers_spring_one2gxSplunk as a_big_data_platform_for_developers_spring_one2gx
Splunk as a_big_data_platform_for_developers_spring_one2gxDamien Dallimore
 
OSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkOSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkNETWAYS
 
6-ZeroLab_decentralized_applications-2008.pptx
6-ZeroLab_decentralized_applications-2008.pptx6-ZeroLab_decentralized_applications-2008.pptx
6-ZeroLab_decentralized_applications-2008.pptxClaudioTebaldi2
 
CRX Best practices
CRX Best practicesCRX Best practices
CRX Best practiceslisui0807
 
IDEALIZE 2023 - NodeJS & Firebase Session
IDEALIZE 2023 - NodeJS & Firebase SessionIDEALIZE 2023 - NodeJS & Firebase Session
IDEALIZE 2023 - NodeJS & Firebase SessionBrion Mario
 

Ähnlich wie Log everything! (20)

Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
K8Guard - An Auditing System For Kubernetes
K8Guard - An Auditing System For KubernetesK8Guard - An Auditing System For Kubernetes
K8Guard - An Auditing System For Kubernetes
 
Java User Group Freiburg - Internet of Things für Java-Entwickler
Java User Group Freiburg - Internet of Things für Java-EntwicklerJava User Group Freiburg - Internet of Things für Java-Entwickler
Java User Group Freiburg - Internet of Things für Java-Entwickler
 
BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...
BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...
BlackHat EU 2012 - Zhenhua Liu - Breeding Sandworms: How To Fuzz Your Way Out...
 
Webinar - Analyzing Video
Webinar - Analyzing VideoWebinar - Analyzing Video
Webinar - Analyzing Video
 
Laszlo PyCon 2005
Laszlo PyCon 2005Laszlo PyCon 2005
Laszlo PyCon 2005
 
Yannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflowYannis Zarkadas. Enterprise data science workflows on kubeflow
Yannis Zarkadas. Enterprise data science workflows on kubeflow
 
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...
Yannis Zarkadas. Stefano Fioravanzo. Enterprise data science workflows on kub...
 
Fluentd meetup logging infrastructure in paa s
Fluentd meetup   logging infrastructure in paa sFluentd meetup   logging infrastructure in paa s
Fluentd meetup logging infrastructure in paa s
 
SplunkLive! Detroit April 2013 - Domino's Pizza
SplunkLive! Detroit April 2013 - Domino's PizzaSplunkLive! Detroit April 2013 - Domino's Pizza
SplunkLive! Detroit April 2013 - Domino's Pizza
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Monitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to backMonitor OpenStack Environments from the bottom up and front to back
Monitor OpenStack Environments from the bottom up and front to back
 
Overpowered Kubernetes: CI/CD for K8s on Enterprise IaaS
Overpowered Kubernetes: CI/CD for K8s on Enterprise IaaSOverpowered Kubernetes: CI/CD for K8s on Enterprise IaaS
Overpowered Kubernetes: CI/CD for K8s on Enterprise IaaS
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Splunk as a_big_data_platform_for_developers_spring_one2gx
Splunk as a_big_data_platform_for_developers_spring_one2gxSplunk as a_big_data_platform_for_developers_spring_one2gx
Splunk as a_big_data_platform_for_developers_spring_one2gx
 
OSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd ErkOSMC 2022 | Current State of icinga by Bernd Erk
OSMC 2022 | Current State of icinga by Bernd Erk
 
6-ZeroLab_decentralized_applications-2008.pptx
6-ZeroLab_decentralized_applications-2008.pptx6-ZeroLab_decentralized_applications-2008.pptx
6-ZeroLab_decentralized_applications-2008.pptx
 
CRX Best practices
CRX Best practicesCRX Best practices
CRX Best practices
 
IDEALIZE 2023 - NodeJS & Firebase Session
IDEALIZE 2023 - NodeJS & Firebase SessionIDEALIZE 2023 - NodeJS & Firebase Session
IDEALIZE 2023 - NodeJS & Firebase Session
 

Log everything!

  • 1. Log everything! Dr. Stefan Schadwinkel und Mike Lohmann 1  
  • 2. Who we are. Dr. Stefan Schadwinkel Mike Lohmann Analytics Architektur Author (heise.de, Cereb.Cortex, EJN, J.Neurophysiol.) Author (PHPMagazin, IX, heise.de) Log everything 2   2
  • 3. Agenda. §  What we do. What we need to do. What we are doing. §  Requirement: Log everything! §  Infrastructure and technologies. §  We want happy business users.   Log everything 3   3
  • 4. Icans GmbH Log everything 4   4
  • 5. Numberfacts of PokerStrategy.com 7.600.000 Requests/Day PokerStrategy.com Education since 2005 6.000.000 19 Languages Registered Users 2.800.000 700.000 PI/Day Posts/Day Log everything 5   5
  • 6. Topics of this talk - How to use existing technologies and standards. - Out of the box solution - Scalability and simplicity of the solution - Ready to use scripts   - „Good enough“ for now! - Showing way from requirement to solution. - OpenSource Sf2 bundles for logging. - Livedemo. Log everything 6   6
  • 7. What we do. §  We teach Poker. §  We create webapplications. §  We serve millions of users in different countries respecting a multitude of market rules. §  We make business decisions driven by complex data analytics. Log everything 7   7
  • 8. What we need to do. §  We need to try out other teaching topics, fast. §  We need to gather data from all of these „try outs“ to accumulate them and build business decisions on their analysis. §  We need a bigger infrastructure to gather more data. §  We need to hire more (good) people! J Log everything 8   8
  • 9. What we are doing. §  We build ECF (Education Community Framework). §  We (can) log everything! §  We (now) use Amazon S3 and Amazon EMR to have a scaling storage and map reduce solution. §  We hire (good) people! J Log everything 9   9
  • 10. Requirement: Log everything. §  „Are you mad?!“ §  „Be more specific, please!“ §  „But what about the user‘s data?!“   Log everything 10   10
  • 11. Logging Tools / Technologies Producer Transport Storage Analytics Symfony2 Now: Now: MapReduce RabbitMQ S3 Storage Hive Application Erlang Consumer Hadoop via Server and Amazon BI via QlikView Was: EMR Databases Flume Was: Virtualized Inhouse Hadoop 15.10.12 11   11
  • 12. Logging Infrastructure Producer Transport Storage Analytics Databases Hadoop - Cluster QlikView   App Reverse 1-x LB Proxy S3 Graylog   Consumer Zabbix   Rabbit MQ 15.10.12 12   12
  • 13. Producer /Home Page Controller PageHit-Event PageHit Event Shovel Logger::log() Listener Monolog- Local Logger RabbitMQ Processor Formatter LogMessage, JSON Handler 15.10.12 13   13
  • 14. Producer §  LoggingComponent: Provides interfaces, filters and handlers §  LoggingBundle: Glues all together with Symfony2   h=ps://github.com/ICANS/IcansLoggingComponent   h=ps://github.com/ICANS/IcansLoggingBundle     15.10.12 14   14
  • 15. Transport – First Try §  Hey, if we use Hadoop, why not use Flume? -  Part of the Ecosystem -  Central config -  Extensible via Plugins -  Flexible Flow Configuration -  How? : Flume Nodes à Flume Sinks 15.10.12 15   15
  • 16. Transport – First Try §  But, .. wait! -  Ecosystem? Just like Hadoop version numbers… -  Admins say: Central config woes! -  issues: multi-master, logical vs. physical nodes, Java heap space, etc. -  Will my plugin run with flume-ng? -  Ever tried to keep your complex flow and switch reliability levels? Read: Our admins still hate me … 15.10.12 16   16
  • 17. Transport – Second Try §  RabbitMQ vs. Flume Nodes -  Each app server has ist own local RabbitMQ -  The local RabbitMQ shovels ist data to a central RabbitMQ cluster -  Similar to the Flume Node concept -  Decentralized config: Producers and consumers simply connect 15.10.12 17   17
  • 18. Transport – Second Try §  But, .. wait! We still need Sinks. -  Custom crafted RabbitMQ consumers -  We could write them in PHP, but .. -  Erlang, teh awesome! -  Battle-hardened OTP framework. -  „Let it crash!“ .. and recover. -  Hot code change. If you want. Read: Runs forever. 15.10.12 18   18
  • 19. Storage – First Try §  Use out-of-the-box Hadoop (Cloudera) §  But: -  Virtualized Infrastructure -  Unknown usage patterns Hadoop -  Must be cost effective -  Major Hadoop version upgrades 15.10.12 19   19
  • 20. Storage – Second Try §  Use Amazon Webservices §  Provides flexible virtualized infrastructure §  Cost-effective storage: S3 Amazon S3 §  Hadoop on demand: EMR 15.10.12 20   20
  • 21. Storage – Storage Amazon S3 §  Erlang RabbitMQ consumer simply copies the incoming data to S3 - Easy: exchange „hadoop“ command with „s3cmd“ Amazon S3 15.10.12 21   21
  • 22. Storage – Storage Amazon S3 §  S3 bucket receives many small, compressed log file chunks §  Amazon provides s3DistCp which does distributed data copy: -  Aggregate many small files into partitioned large chunks Amazon S3 -  Change compression 15.10.12 22   22
  • 23. Analytics §  We want happy business users. §  We want to answer questions. -  People want answers to questions they have. Now. -  No, they couldn‘t tell you that question yesterday. If they had known, they would have already asked for the answer. Yesterday. §  We also want data-driven applications. -  Production system analysis. -  Fraud prevention. -  Recommendations. -  Social metrics for our users.   15.10.12 23   23
  • 24. Analytics §  Remember MapReduce. -  Custom Jobs. -  Streaming: Use your favorite. -  Java API: Cascading. Use your favorite: Java, Groovy, Clojure, Scala. -  Data Queries. -  Hive: similar to SQL. -  Pig: Data flow. -  Cascalog: Datalog-like QL using Clojure and Cascading.   15.10.12 24   24
  • 25. Analytics §  Cascalog is Clojure, Clojure is Lisp (?<- (stdout) [?person] (age ?person ?age) … (< ?age 30)) Query Cascading Columns of „Generator“ „Predicate“ Operator Output Tap the dataset generated by the query §  as many as you want §  both can be any clojure function §  clojure can call anything that is available within a JVM 15.10.12 25   25
  • 26. Analytics §  We use Cascalog to preprocess and organize that incoming flow of log messages: 15.10.12 26   26
  • 27. Analytics §  Let‘s run the Cascalog processing on Amazon EMR: ./elastic-mapreduce --create --name „Log Message Compaction" --bootstrap-action s3://[BUCKET]/mapreduce/configure-daemons --num-instances $NUM --slave-instance-type m1.large --master-instance-type m1.large --jar s3://[BUCKET]/mapreduce/compaction/icans-cascalog.jar --step-action TERMINATE_JOB_FLOW --step-name "Cascalog" --main-class icans.cascalogjobs.processing.compaction --args "s3://[BUCKET]/incoming/*/*/*/","s3://[BUCKET]/icanslog","s3://[BUCKET]/icanslog-error 15.10.12 27   27
  • 28. Analytics §  After the Cascalog Query we have: s3://[BUCKET]/icanslog/[WEBSITE]/icans.content/year=2012/month=10/day=01/part-00000.lzo Hive  ParSSoning!   15.10.12 28   28
  • 29. Analytics §  Now  we  can  access  the  log  data  within  Hive: 15.10.12 29   29
  • 30. Analytics §  Now  we  can  run  Hive  queries  on  the  [WEBSITE]_icanslog_content  table!   §  But  we  also  want  to  store  the  result  to  S3. 15.10.12 30   30
  • 31. Analytics §  Now,  get  the  stats: 15.10.12 31   31
  • 32. Analytics §  We can now simply copy the data from S3 and import in any local analytical tool, like: -  Excel (It must really make business people happy…) -  QlikView (Anyone can be happy with it…) -  R (If I want an answer…) 15.10.12 32   32
  • 33. Merci. ? Questions 15.10.12 33   33
  • 34. Contacts. Dr. Stefan Schadwinkel Mike Lohmann stefan.schadwinkel@icans-gmbh.com mike.lohmann@icans-gmbh.com ICANS_StScha mikelohmann 15.10.12 34   34
  • 35. Tools/Technologies 15.10.12 35   35
  • 36. ICANS GmbH Valentinskamp 18 20354 Hamburg Germany Phone: +49 40 22 63 82 9-0 Fax: +49 40 38 67 15 92 Web: www.icans-gmbh.com   36