SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Common	
  and	
  Unique	
  Use	
  Cases	
  
for	
  Apache	
  Hadoop	
  
	
  
August	
  30,	
  2011	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Exploding	
  Data	
  Volumes	
  
•  Online	
  
       •      Web-­‐ready	
  devices	
  
       •      Social	
  media	
  
                                                                                                                            Complex, Unstructured
       •      Digital	
  content	
  
       •      Smart	
  grids	
  


•  Enterprise	
                                                                   Relational

       •  TransacBons	
  	
  
       •  R&D	
  data	
  
       •  OperaBonal	
  (control)	
  data	
  
                                                                                                 	
  
       	
                                                                                        Digital	
  universe	
  grew	
  by	
  62%	
  last	
  year	
  to	
  
       2,500	
  exabytes	
  of	
  new	
  informaBon	
  in	
                                      800K	
  petabytes	
  and	
  will	
  grow	
  to	
  1.2	
  
       2012	
  with	
  Internet	
  as	
  primary	
  driver	
                                     “zeabytes”	
  this	
  year	
  	
  
                                                                                                 Source:	
  An	
  IDC	
  White	
  Paper	
  -­‐	
  sponsored	
  by	
  EMC.	
  As	
  the	
  Economy	
  Contracts,	
  the	
  
	
                                                                                               Digital	
  Universe	
  Expands.	
  May	
  2009	
  

                                                                                       	
  
                                           Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Origin	
  of	
  Hadoop	
  
How	
  does	
  an	
  elephant	
  sneak	
  up	
  on	
  you?	
  



                                                                                                                           Hadoop	
  wins	
  
                                                                                                                           Terabyte	
  sort	
  
                                                                                                                           benchmark	
  

                                                                                                                                                                Releases	
  
                                                                           Open	
  Source,	
                                                                    CDH3	
  and	
  
                                  Publishes	
                              MapReduce	
                                                                          Cloudera	
  
                                  MapReduce,	
                             &	
  HDFS	
                          Runs	
  4,000	
                                 Enterprise	
  
    Open	
  Source,	
             GFS	
  Paper	
                           project	
                            Node	
  Hadoop	
  
    Web	
  Crawler	
                                                       created	
  by	
                      Cluster	
  
    project	
                                                                                                                             Launches	
  SQL	
  
                                                                           Doug	
  Cucng	
  
    created	
  by	
                                                                                                                       Support	
  for	
  
    Doug	
  Cucng	
                                                                                                                       Hadoop	
  


2002	
             2003	
     2004	
            2005	
               2006	
                2007	
               2008	
               2009	
             2010	
  




                                           Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
Open	
  Source	
  Storage	
  and	
  Processing	
  Engine	
  


                                                                              • 	
  Consolidates	
  Everything	
  
                                                                                        • 	
  Move	
  complex	
  and	
  relaBonal	
  	
  
                                                                                        data	
  into	
  a	
  single	
  repository	
  

                                                                              • 	
  Stores	
  Inexpensively	
  
                                                                                        • 	
  Keep	
  raw	
  data	
  always	
  available	
  
                 MapReduce	
  
                                                                                        • 	
  Use	
  commodity	
  hardware	
  

                                                                              • 	
  Processes	
  at	
  the	
  Source	
  
                                                                                        • 	
  Eliminate	
  ETL	
  bolenecks	
  
          Hadoop	
  Distributed	
                                                       • 	
  Mine	
  data	
  first,	
  govern	
  later	
  	
  
          File	
  System	
  (HDFS)	
  


                                Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
The	
  Standard	
  Way	
  Big	
  Data	
  Gets	
  Done	
  

•  Hadoop	
  is	
  Flexible:	
  
       •    Structured,	
  unstructured	
  
       •    Schema,	
  no	
  schema	
  
       •    High	
  volume,	
  merely	
  terabytes	
  
       •    All	
  kinds	
  of	
  analyBc	
  applicaBons	
  

•  Hadoop	
  is	
  Open:	
  100%	
  Apache-­‐licensed	
  open	
  source	
  

•  Hadoop	
  is	
  Scalable:	
  Proven	
  at	
  petabyte	
  scale	
  

•  Benefits:	
  
      •  Controls	
  costs	
  by	
  storing	
  data	
  more	
  affordably	
  per	
  terabyte	
  than	
  any	
  other	
  
         plalorm	
  
      •  Drives	
  revenue	
  by	
  extracBng	
  value	
  from	
  data	
  that	
  was	
  previously	
  out	
  of	
  reach	
  


                                     Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
What	
  is	
  Apache	
  Hadoop?	
  
The	
  Importance	
  of	
  Being	
  Open	
  



  No	
  Lock-­‐In	
  -­‐	
  Investments	
  in	
  skills,	
  services	
  &	
  	
  
  hardware	
  are	
  preserved	
  regardless	
  of	
  vendor	
  choice	
  



  Community	
  Development	
  -­‐	
  Hadoop	
  &	
  	
  
  related	
  projects	
  are	
  expanding	
  at	
  a	
  	
  
  rapid	
  pace	
  


  Rich	
  Ecosystem	
  -­‐	
  Dozens	
  of	
  	
  
  complementary	
  somware,	
  hardware	
  
  	
  and	
  services	
  firms	
  	
  

                                     Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  

•  Common	
  uses	
  of	
  logs	
  
       •  Find	
  or	
  count	
  events	
  (grep)	
  
       grep	
  “ERROR”	
  file	
  
       grep	
  -­‐c	
  “ERROR”	
  file	
  


       •  Calculate	
  metrics	
  (performance	
  or	
  user	
  behavior	
  analysis)	
  
       awk	
  ‘{sums[$1]+=$2;	
  counts[$1]+=1}	
  END	
  {for(k	
  in	
  counts)	
  {print	
  sums[k]/counts	
  [k]}}’	
  


       •  InvesBgate	
  user	
  sessions	
  
       grep	
  “USER”	
  files	
  …	
  |	
  sort	
  |	
  less	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Shoot…too	
  much	
  data	
  

       •  Homegrown	
  parallel	
  processing	
  omen	
  done	
  on	
  per	
  file	
  basis,	
  cause	
  it’s	
  
          easy	
  

               •  No	
  parallelism	
  on	
  a	
  single	
  large	
  file	
  

                                                        Task	
  0	
  


                                                             access_log	
  



                                  Task	
  1	
                                  Task	
  2	
  


                                  access_log	
                                      access_log	
  
Log	
  Processing	
  
  A	
  Perfect	
  Fit	
  
  •  MapReduce	
  to	
  the	
  rescue!	
  

         •  Processing	
  is	
  done	
  per	
  unit	
  of	
  data	
  



                                              Task	
  0	
                Task	
  1	
                    Task	
  2	
            Task	
  3	
  

access_log	
  

                                   	
  	
  	
  0-­‐64MB	
     	
     	
  64-­‐128MB                   	
  128-­‐192MB   	
  192-­‐256MB	
  


                            Each	
  task	
  is	
  responsible	
  for	
  a	
  unit	
  of	
  data	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Network	
  or	
  disk	
  are	
  bolenecks

       •  Reading	
  100GB	
  of	
  data	
  

               •  14	
  minutes	
  with	
  1GbE	
  network	
  connecBon	
  

               •  22	
  minutes	
  on	
  standard	
  disk	
  drive	
  




                                                                              access_log	
  
                                                                   ited	
  
                                             Bandwidth	
  is	
  lim
                          grep	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Hadoop	
  to	
  the	
  rescue!	
  

       •  Eliminates	
  network	
  boleneck,	
  data	
  is	
  on	
  local	
  disk	
  

       •  Data	
  is	
  read	
  from	
  many,	
  many	
  disks	
  in	
  parallel	
  

	
                                                        Physical	
  Machines	
  

          NodeA	
                      NodeX	
                            NodeY	
             NodeZ	
  


            Task	
  0	
                   Task	
  1	
                       Task	
  2	
        Task	
  3	
  




           0-­‐64MB	
                  64-­‐128MB	
                      128-­‐192MB	
      192-­‐256MB	
  
Log	
  Processing	
  
A	
  Perfect	
  Fit	
  
•  Hadoop	
  currently	
  scales	
  to	
  4,000	
  nodes	
  

       •  Goal	
  for	
  next	
  release	
  is	
  10,000	
  nodes	
  

•  Nodes	
  typically	
  have	
  12	
  hard	
  drives	
  

•  A	
  single	
  hard	
  drive	
  has	
  throughput	
  of	
  about	
  75MB/second	
  

•  12	
  Hard	
  Drives	
  *	
  75	
  MB/second	
  *	
  4000	
  Nodes	
  =	
  3.4	
  TB/second	
  

       •  That’s	
  bytes,	
  not	
  bits	
  

•  That’s	
  enough	
  bandwidth	
  to	
  read	
  1PB	
  (1000	
  TB)	
  in	
  5	
  minutes	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Catching	
  `Osama’	
  
Embarrassingly	
  Parallel	
  

•  You	
  have	
  a	
  few	
  billion	
  images	
  of	
  faces	
  with	
  geo-­‐tags	
  
     •  Tremendous	
  storage	
  problem	
  

     •  Tremendous	
  processing	
  problem	
  

          •  Bandwidth	
  

          •  CoordinaBon	
  
Catching	
  `Osama’	
  
Embarrassingly	
  Parallel	
  

•  Store	
  the	
  images	
  in	
  Hadoop	
  

•  When	
  processing,	
  Hadoop	
  will	
  read	
  the	
  images	
  from	
  
   local	
  disk,	
  thousands	
  of	
  local	
  disks	
  spread	
  throughout	
  
   the	
  cluster	
  

•  Use	
  Map	
  only	
  job	
  to	
  compare	
  input	
  images	
  against	
  
   `needle’	
  image	
  
Catching	
  `Osama’	
  
Embarrassingly	
  Parallel	
  
                                                                         Tasks	
  have	
  copy	
  of	
  `needle’	
  




                                                 Map	
  Task	
  0	
     Map	
  Task	
  1	
  
                                                        	
                     	
  
                                                        	
                     	
  



Store	
  images	
  in	
  Sequence	
  Files	
  




                                                                                                                   Output	
  faces	
  
                                                                                                                   `matching’	
  needle	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  

•  One	
  of	
  the	
  most	
  common	
  use	
  cases	
  I	
  see	
  is	
  replacing	
  
   ETL	
  processes	
  

•  Hadoop	
  is	
  a	
  huge	
  sink	
  of	
  cheap	
  storage	
  and	
  processing	
  

•  Aggregates	
  built	
  in	
  Hadoop	
  and	
  exported	
  

•  Apache	
  Hive	
  provides	
  SQL	
  like	
  querying	
  on	
  raw	
  data	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  


`Real’	
  Time	
  System	
  (Website)	
                                  Data	
  Warehouse	
  

                                                                             Business	
  
                                                                           Intelligence	
  
                                                                           ApplicaBons	
  




                Online	
                                                    AnalyBcal	
  
                 DB	
                                                          DB	
  
                                                             ETL	
  



                                    Much	
  blood	
  shed,	
  here	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  


`Real’	
  Time	
  System	
  (Website)	
                                      Data	
  Warehouse	
  

                                                                                 Business	
  
                                                                               Intelligence	
  
                                                                               ApplicaBons	
  




                Online	
                                                        AnalyBcal	
  
                 DB	
                                                              DB	
  
                                    Import         Hadoop	
  
                                            	
  
                                                                Export	
  
Extract	
  Transform	
  Load	
  (ETL)	
  
Everyone	
  is	
  doing	
  it	
  


`Real’	
  Time	
  System	
  (Website)	
                                        Data	
  Warehouse	
  

                                                                                   Business	
  
                                                                                 Intelligence	
  
                                                                                 ApplicaBons	
  




                Online	
                                                          AnalyBcal	
  
                 DB	
                                                                DB	
  
                                    Apache           Hadoop	
  
                                              	
  
                                    Sqoop
                                         	
                       Apache	
  
                                                                   Sqoop	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
AnalyScs	
  in	
  HBase	
  
Scaling	
  writes	
  
•  AnalyBcs	
  is	
  omen	
  simply	
  counBng	
  things	
  

•  Facebook	
  chose	
  HBase	
  to	
  store	
  it’s	
  massive	
  counter	
  infrastructure	
  (more	
  
   later)	
  

•  How	
  might	
  one	
  implement	
  a	
  counter	
  infrastructure	
  in	
  HBase?	
  
AnalyScs	
  in	
  HBase	
  
Scaling	
  writes	
  


                                                            User	
  &	
  Content	
  Type	
  Counters	
  
   `Like’	
  buon	
  IMG	
  request	
  	
  
     sends	
  HTTP	
  request	
  to	
          User	
                  Content	
           Counter	
  
    Facebook	
  servers	
  which	
             brock@me.com	
   NEWS	
                     5431	
  
 increments	
  several	
  counters	
  
                                               brock@me.com	
   TECH	
                     79310	
  
                                               brock@me.com	
   SHOPPING	
                 59	
  
                                               tom@him.com	
   SPORTS	
                    94214	
  


                                                          Individual	
  Page	
  Counters	
  
                                               URL	
                                           Counter	
  
                                               com.cloudera/blog/…	
                           154	
  
                                               com.cloudera/downloads/…	
                      923621	
  
                                               com.cloudera/resources/…	
                      2138	
  
AnalyScs	
  in	
  HBase	
  
 Scaling	
  writes	
  

                                                                                      Individual	
  Page	
  Counters	
  
Host	
  is	
  reversed	
  in	
  URL	
  as	
  part	
  of	
  the	
  key	
     URL	
                                          Counter	
  
                                                                            com.cloudera/blog/…	
                          154	
  
                                                                            com.cloudera/downloads/…	
                     923621	
  
                                                                            com.cloudera/resources/…	
                     2138	
  




     •  Data	
  is	
  physically	
  stored	
  in	
  sorted	
  order	
  
        	
  
     •  Scanning	
  all	
  `com.cloudera’	
  counters	
  results	
  in	
  sequenBal	
  I/O	
  
Facebook	
  AnalyScs	
  
Scaling	
  writes	
  

•  Real-­‐Bme	
  counters	
  of	
  URLs	
  shared,	
  links	
  “liked”,	
  
   impressions	
  generated	
  

•  20	
  billion	
  events/day	
  (200K	
  events/sec)	
  

•  ~30	
  second	
  latency	
  from	
  click	
  to	
  count	
  

•  Heavy	
  use	
  of	
  incrementColumnValue	
  API	
  for	
  
   consistent	
  counters	
  

•  Tried	
  MySQL,	
  Cassandra,	
  seled	
  on	
  HBase	
  
    	
  
    hp://Bny.cloudera.com/hbase-­‐„-­‐analyBcs	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  


       Text	
  Clustering	
  on	
  Google	
  News	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  


       CollaboraBve	
  Filtering	
  on	
  Amazon	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  



      ClassificaBon	
  in	
  GMail	
  
Machine	
  Learning	
  
Apache	
  Mahout	
  

•  Apache	
  Mahout	
  implements	
  
     •  CollaboraBve	
  Filtering	
  	
  

     •  ClassificaBon	
  	
  

     •  Clustering	
  

     •  Frequent	
  itemset	
  

•  More	
  coming	
  with	
  the	
  integraBon	
  of	
  MapReduce.Next	
  
Agenda	
  

•    What	
  is	
  Apache	
  Hadoop?	
  
•    Log	
  Processing	
  
•    Catching	
  `Osama’	
  
•    Extract	
  Transform	
  Load	
  (ETL)	
  
•    AnalyBcs	
  in	
  HBase	
  
•    Machine	
  Learning	
  
•    Final	
  Thoughts	
  



                          Copyright	
  2011	
  Cloudera	
  Inc.	
  All	
  rights	
  reserved	
  
Final	
  Thoughts	
  
Use	
  the	
  right	
  tool	
  
•    Other	
  use	
  cases	
  

       •     OpenTSDB	
  an	
  open	
  distributed,	
  scalable	
  Time	
  Series	
  Database	
  (TSDB)	
  

       •     Building	
  Search	
  Indexes	
  (canonical	
  use	
  case)	
  

       •     Facebook	
  Messaging	
  

       •     Cheap	
  and	
  Deep	
  Storage,	
  e.g.	
  archiving	
  emails	
  for	
  SOX	
  compliance	
  

       •     Audit	
  Logging	
  

•    Non-­‐Use	
  Cases	
  

       •     Data	
  processing	
  is	
  handled	
  by	
  one	
  beefy	
  server	
  

       •     Data	
  requires	
  transacBons	
  
About	
  the	
  Presenter	
  
•  Brock	
  Noland	
  

•  brock@cloudera.com	
  

•  hp://twier.com/brocknoland	
  

•  TC-­‐HUG	
  hp://tch.ug	
  

Weitere ähnliche Inhalte

Was ist angesagt?

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataRichard McDougall
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondDataWorks Summit
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Gavin Heavyside
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...yaevents
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championAmeet Paranjape
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Jonathan Seidman
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 

Was ist angesagt? (20)

Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
10c introduction
10c introduction10c introduction
10c introduction
 
Architecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big DataArchitecting Virtualized Infrastructure for Big Data
Architecting Virtualized Infrastructure for Big Data
 
Hadoop on Virtual Machines
Hadoop on Virtual MachinesHadoop on Virtual Machines
Hadoop on Virtual Machines
 
Apache Hadoop Now Next and Beyond
Apache Hadoop Now Next and BeyondApache Hadoop Now Next and Beyond
Apache Hadoop Now Next and Beyond
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010Introduction to Hadoop - ACCU2010
Introduction to Hadoop - ACCU2010
 
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
 
hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
Cloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a championCloud Austin Meetup - Hadoop like a champion
Cloud Austin Meetup - Hadoop like a champion
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 

Andere mochten auch

Pr2reachingbuyers 1208953924235198-8(1)
Pr2reachingbuyers 1208953924235198-8(1)Pr2reachingbuyers 1208953924235198-8(1)
Pr2reachingbuyers 1208953924235198-8(1)Sadiq Nosairat
 
E24:00 SWAPeritif June 2010
E24:00 SWAPeritif June 2010E24:00 SWAPeritif June 2010
E24:00 SWAPeritif June 2010Patrizia De Luca
 
Kicking it up a notch!
Kicking it up a notch!Kicking it up a notch!
Kicking it up a notch!arusso910
 
Mediacoach Hilde Lingier, ASO De Pinte
Mediacoach Hilde Lingier, ASO De PinteMediacoach Hilde Lingier, ASO De Pinte
Mediacoach Hilde Lingier, ASO De PinteHilde Lingier
 
Social network and modern war
Social network and modern warSocial network and modern war
Social network and modern warJons Song
 
Digitaal prentenboek Antonio en Lieze project mediawijsheid-kopie
Digitaal prentenboek Antonio en Lieze project mediawijsheid-kopieDigitaal prentenboek Antonio en Lieze project mediawijsheid-kopie
Digitaal prentenboek Antonio en Lieze project mediawijsheid-kopieHilde Lingier
 
ABC van de sociale media --
ABC van de sociale media --ABC van de sociale media --
ABC van de sociale media --Hilde Lingier
 
Bib op school: mediawijsheid en nog meer...
Bib op school: mediawijsheid en nog meer...Bib op school: mediawijsheid en nog meer...
Bib op school: mediawijsheid en nog meer...Hilde Lingier
 
Межтерриториальный сетевой проект "Россия - родина моя!"
Межтерриториальный сетевой проект "Россия - родина моя!"Межтерриториальный сетевой проект "Россия - родина моя!"
Межтерриториальный сетевой проект "Россия - родина моя!"geoledi
 
Broiler Chicken Catching Procedure
Broiler Chicken Catching ProcedureBroiler Chicken Catching Procedure
Broiler Chicken Catching ProcedureSherwin Camba
 
Nutrition management Broiler
Nutrition management BroilerNutrition management Broiler
Nutrition management BroilerSherwin Camba
 
Finance project(final)
Finance project(final)Finance project(final)
Finance project(final)daemons123
 

Andere mochten auch (19)

Pr2reachingbuyers 1208953924235198-8(1)
Pr2reachingbuyers 1208953924235198-8(1)Pr2reachingbuyers 1208953924235198-8(1)
Pr2reachingbuyers 1208953924235198-8(1)
 
1 uno
1 uno1 uno
1 uno
 
E24:00 SWAPeritif June 2010
E24:00 SWAPeritif June 2010E24:00 SWAPeritif June 2010
E24:00 SWAPeritif June 2010
 
Thang canh
Thang canhThang canh
Thang canh
 
Kicking it up a notch!
Kicking it up a notch!Kicking it up a notch!
Kicking it up a notch!
 
Fachadas de casas
Fachadas de casasFachadas de casas
Fachadas de casas
 
Mediacoach Hilde Lingier, ASO De Pinte
Mediacoach Hilde Lingier, ASO De PinteMediacoach Hilde Lingier, ASO De Pinte
Mediacoach Hilde Lingier, ASO De Pinte
 
Social network and modern war
Social network and modern warSocial network and modern war
Social network and modern war
 
Digitaal prentenboek Antonio en Lieze project mediawijsheid-kopie
Digitaal prentenboek Antonio en Lieze project mediawijsheid-kopieDigitaal prentenboek Antonio en Lieze project mediawijsheid-kopie
Digitaal prentenboek Antonio en Lieze project mediawijsheid-kopie
 
Sm@rtbib is ...
Sm@rtbib is ...Sm@rtbib is ...
Sm@rtbib is ...
 
ABC van de sociale media --
ABC van de sociale media --ABC van de sociale media --
ABC van de sociale media --
 
Bib op school: mediawijsheid en nog meer...
Bib op school: mediawijsheid en nog meer...Bib op school: mediawijsheid en nog meer...
Bib op school: mediawijsheid en nog meer...
 
Git: a representation
Git: a representationGit: a representation
Git: a representation
 
Scrum in-a-flash
Scrum in-a-flashScrum in-a-flash
Scrum in-a-flash
 
Межтерриториальный сетевой проект "Россия - родина моя!"
Межтерриториальный сетевой проект "Россия - родина моя!"Межтерриториальный сетевой проект "Россия - родина моя!"
Межтерриториальный сетевой проект "Россия - родина моя!"
 
Think about Development-mindset
Think about Development-mindsetThink about Development-mindset
Think about Development-mindset
 
Broiler Chicken Catching Procedure
Broiler Chicken Catching ProcedureBroiler Chicken Catching Procedure
Broiler Chicken Catching Procedure
 
Nutrition management Broiler
Nutrition management BroilerNutrition management Broiler
Nutrition management Broiler
 
Finance project(final)
Finance project(final)Finance project(final)
Finance project(final)
 

Ähnlich wie Commonanduniqueusecases 110831113310-phpapp01

Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsDataWorks Summit
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 

Ähnlich wie Commonanduniqueusecases 110831113310-phpapp01 (20)

Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI ToolsIntroduction to Microsoft HDInsight and BI Tools
Introduction to Microsoft HDInsight and BI Tools
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 

Kürzlich hochgeladen

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Commonanduniqueusecases 110831113310-phpapp01

  • 1. Common  and  Unique  Use  Cases   for  Apache  Hadoop     August  30,  2011  
  • 2. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 3. Exploding  Data  Volumes   •  Online   •  Web-­‐ready  devices   •  Social  media   Complex, Unstructured •  Digital  content   •  Smart  grids   •  Enterprise   Relational •  TransacBons     •  R&D  data   •  OperaBonal  (control)  data       Digital  universe  grew  by  62%  last  year  to   2,500  exabytes  of  new  informaBon  in   800K  petabytes  and  will  grow  to  1.2   2012  with  Internet  as  primary  driver   “zeabytes”  this  year     Source:  An  IDC  White  Paper  -­‐  sponsored  by  EMC.  As  the  Economy  Contracts,  the     Digital  Universe  Expands.  May  2009     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 4. Origin  of  Hadoop   How  does  an  elephant  sneak  up  on  you?   Hadoop  wins   Terabyte  sort   benchmark   Releases   Open  Source,   CDH3  and   Publishes   MapReduce   Cloudera   MapReduce,   &  HDFS   Runs  4,000   Enterprise   Open  Source,   GFS  Paper   project   Node  Hadoop   Web  Crawler   created  by   Cluster   project   Launches  SQL   Doug  Cucng   created  by   Support  for   Doug  Cucng   Hadoop   2002   2003   2004   2005   2006   2007   2008   2009   2010   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 5. What  is  Apache  Hadoop?   Open  Source  Storage  and  Processing  Engine   •   Consolidates  Everything   •   Move  complex  and  relaBonal     data  into  a  single  repository   •   Stores  Inexpensively   •   Keep  raw  data  always  available   MapReduce   •   Use  commodity  hardware   •   Processes  at  the  Source   •   Eliminate  ETL  bolenecks   Hadoop  Distributed   •   Mine  data  first,  govern  later     File  System  (HDFS)   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 6. What  is  Apache  Hadoop?   The  Standard  Way  Big  Data  Gets  Done   •  Hadoop  is  Flexible:   •  Structured,  unstructured   •  Schema,  no  schema   •  High  volume,  merely  terabytes   •  All  kinds  of  analyBc  applicaBons   •  Hadoop  is  Open:  100%  Apache-­‐licensed  open  source   •  Hadoop  is  Scalable:  Proven  at  petabyte  scale   •  Benefits:   •  Controls  costs  by  storing  data  more  affordably  per  terabyte  than  any  other   plalorm   •  Drives  revenue  by  extracBng  value  from  data  that  was  previously  out  of  reach   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 7. What  is  Apache  Hadoop?   The  Importance  of  Being  Open   No  Lock-­‐In  -­‐  Investments  in  skills,  services  &     hardware  are  preserved  regardless  of  vendor  choice   Community  Development  -­‐  Hadoop  &     related  projects  are  expanding  at  a     rapid  pace   Rich  Ecosystem  -­‐  Dozens  of     complementary  somware,  hardware    and  services  firms     Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 8. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 9. Log  Processing   A  Perfect  Fit   •  Common  uses  of  logs   •  Find  or  count  events  (grep)   grep  “ERROR”  file   grep  -­‐c  “ERROR”  file   •  Calculate  metrics  (performance  or  user  behavior  analysis)   awk  ‘{sums[$1]+=$2;  counts[$1]+=1}  END  {for(k  in  counts)  {print  sums[k]/counts  [k]}}’   •  InvesBgate  user  sessions   grep  “USER”  files  …  |  sort  |  less  
  • 10. Log  Processing   A  Perfect  Fit   •  Shoot…too  much  data   •  Homegrown  parallel  processing  omen  done  on  per  file  basis,  cause  it’s   easy   •  No  parallelism  on  a  single  large  file   Task  0   access_log   Task  1   Task  2   access_log   access_log  
  • 11. Log  Processing   A  Perfect  Fit   •  MapReduce  to  the  rescue!   •  Processing  is  done  per  unit  of  data   Task  0   Task  1   Task  2   Task  3   access_log        0-­‐64MB      64-­‐128MB  128-­‐192MB  192-­‐256MB   Each  task  is  responsible  for  a  unit  of  data  
  • 12. Log  Processing   A  Perfect  Fit   •  Network  or  disk  are  bolenecks •  Reading  100GB  of  data   •  14  minutes  with  1GbE  network  connecBon   •  22  minutes  on  standard  disk  drive   access_log   ited   Bandwidth  is  lim grep  
  • 13. Log  Processing   A  Perfect  Fit   •  Hadoop  to  the  rescue!   •  Eliminates  network  boleneck,  data  is  on  local  disk   •  Data  is  read  from  many,  many  disks  in  parallel     Physical  Machines   NodeA   NodeX   NodeY   NodeZ   Task  0   Task  1   Task  2   Task  3   0-­‐64MB   64-­‐128MB   128-­‐192MB   192-­‐256MB  
  • 14. Log  Processing   A  Perfect  Fit   •  Hadoop  currently  scales  to  4,000  nodes   •  Goal  for  next  release  is  10,000  nodes   •  Nodes  typically  have  12  hard  drives   •  A  single  hard  drive  has  throughput  of  about  75MB/second   •  12  Hard  Drives  *  75  MB/second  *  4000  Nodes  =  3.4  TB/second   •  That’s  bytes,  not  bits   •  That’s  enough  bandwidth  to  read  1PB  (1000  TB)  in  5  minutes  
  • 15. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 16. Catching  `Osama’   Embarrassingly  Parallel   •  You  have  a  few  billion  images  of  faces  with  geo-­‐tags   •  Tremendous  storage  problem   •  Tremendous  processing  problem   •  Bandwidth   •  CoordinaBon  
  • 17. Catching  `Osama’   Embarrassingly  Parallel   •  Store  the  images  in  Hadoop   •  When  processing,  Hadoop  will  read  the  images  from   local  disk,  thousands  of  local  disks  spread  throughout   the  cluster   •  Use  Map  only  job  to  compare  input  images  against   `needle’  image  
  • 18. Catching  `Osama’   Embarrassingly  Parallel   Tasks  have  copy  of  `needle’   Map  Task  0   Map  Task  1           Store  images  in  Sequence  Files   Output  faces   `matching’  needle  
  • 19. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 20. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   •  One  of  the  most  common  use  cases  I  see  is  replacing   ETL  processes   •  Hadoop  is  a  huge  sink  of  cheap  storage  and  processing   •  Aggregates  built  in  Hadoop  and  exported   •  Apache  Hive  provides  SQL  like  querying  on  raw  data  
  • 21. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   ETL   Much  blood  shed,  here  
  • 22. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Import Hadoop     Export  
  • 23. Extract  Transform  Load  (ETL)   Everyone  is  doing  it   `Real’  Time  System  (Website)   Data  Warehouse   Business   Intelligence   ApplicaBons   Online   AnalyBcal   DB   DB   Apache Hadoop     Sqoop   Apache   Sqoop  
  • 24. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 25. AnalyScs  in  HBase   Scaling  writes   •  AnalyBcs  is  omen  simply  counBng  things   •  Facebook  chose  HBase  to  store  it’s  massive  counter  infrastructure  (more   later)   •  How  might  one  implement  a  counter  infrastructure  in  HBase?  
  • 26. AnalyScs  in  HBase   Scaling  writes   User  &  Content  Type  Counters   `Like’  buon  IMG  request     sends  HTTP  request  to   User   Content   Counter   Facebook  servers  which   brock@me.com   NEWS   5431   increments  several  counters   brock@me.com   TECH   79310   brock@me.com   SHOPPING   59   tom@him.com   SPORTS   94214   Individual  Page  Counters   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138  
  • 27. AnalyScs  in  HBase   Scaling  writes   Individual  Page  Counters   Host  is  reversed  in  URL  as  part  of  the  key   URL   Counter   com.cloudera/blog/…   154   com.cloudera/downloads/…   923621   com.cloudera/resources/…   2138   •  Data  is  physically  stored  in  sorted  order     •  Scanning  all  `com.cloudera’  counters  results  in  sequenBal  I/O  
  • 28. Facebook  AnalyScs   Scaling  writes   •  Real-­‐Bme  counters  of  URLs  shared,  links  “liked”,   impressions  generated   •  20  billion  events/day  (200K  events/sec)   •  ~30  second  latency  from  click  to  count   •  Heavy  use  of  incrementColumnValue  API  for   consistent  counters   •  Tried  MySQL,  Cassandra,  seled  on  HBase     hp://Bny.cloudera.com/hbase-­‐„-­‐analyBcs  
  • 29. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 30. Machine  Learning   Apache  Mahout   Text  Clustering  on  Google  News  
  • 31. Machine  Learning   Apache  Mahout   CollaboraBve  Filtering  on  Amazon  
  • 32. Machine  Learning   Apache  Mahout   ClassificaBon  in  GMail  
  • 33. Machine  Learning   Apache  Mahout   •  Apache  Mahout  implements   •  CollaboraBve  Filtering     •  ClassificaBon     •  Clustering   •  Frequent  itemset   •  More  coming  with  the  integraBon  of  MapReduce.Next  
  • 34. Agenda   •  What  is  Apache  Hadoop?   •  Log  Processing   •  Catching  `Osama’   •  Extract  Transform  Load  (ETL)   •  AnalyBcs  in  HBase   •  Machine  Learning   •  Final  Thoughts   Copyright  2011  Cloudera  Inc.  All  rights  reserved  
  • 35. Final  Thoughts   Use  the  right  tool   •  Other  use  cases   •  OpenTSDB  an  open  distributed,  scalable  Time  Series  Database  (TSDB)   •  Building  Search  Indexes  (canonical  use  case)   •  Facebook  Messaging   •  Cheap  and  Deep  Storage,  e.g.  archiving  emails  for  SOX  compliance   •  Audit  Logging   •  Non-­‐Use  Cases   •  Data  processing  is  handled  by  one  beefy  server   •  Data  requires  transacBons  
  • 36. About  the  Presenter   •  Brock  Noland   •  brock@cloudera.com   •  hp://twier.com/brocknoland   •  TC-­‐HUG  hp://tch.ug