SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
Respect	
  for	
  the	
  elephant	
  –	
  Hadoop	
  


                  Aditya	
  Sakhuja	
  
                aditya@sakhuja.us	
  
                           	
  
Whoami	
  


•  So=ware	
  Engineer	
  @	
  Yahoo	
  Inc.	
  	
  

•  Web	
  Search	
  -­‐>	
  Cloud	
  PlaHorms	
  -­‐>	
  Display	
  Ads	
  Serving	
  
	
  
•  hKp://linkedin.com/in/adityasakhuja	
  
	
  




9/24/11	
                            PyCon	
  UK	
  2011	
  
Agenda	
  
•      MoVvaVon	
  
•      History	
  
•      Ecosystem	
  
•      Daemon	
  processes	
  /	
  High	
  Level	
  View	
  
•      Map	
  Reduce	
  Data	
  Flow	
  
•      HDFS	
  Architecture	
  /	
  ReplicaVon	
  
•      Can	
  /	
  Cannot	
  
•      Ge[ng	
  started	
  yourself	
  
•      Demo	
  
•      Companies	
  Involved	
  
•      Q&A	
  

9/24/11	
                                 PyCon	
  UK	
  2011	
  
MoVvaVon	
  
•  ‘TradiVonal’	
  large-­‐scale	
  compuVng	
  systems	
  -­‐	
  
   problems	
  
•  Desired	
  features	
  in	
  an	
  improved	
  system	
  
•  How	
  Hadoop	
  addresses	
  them	
  




9/24/11	
                     PyCon	
  UK	
  2011	
  
‘TradiVonal’	
  large-­‐scale	
  compuVng	
  systems	
  -­‐	
  
                            problems	
  
                                  	
  
•  CPU	
  intensive	
  over	
  Data	
  intensive	
  
•  MPI	
  ,	
  PVM,	
  	
  RPCs	
  –	
  Parallel	
  ComputaVon	
  
   Frameworks	
  
•  Programming	
  for	
  tradiVonal	
  distributed	
  systems	
  
   is	
  complex	
  
         –  Data	
  exchange	
  requires	
  synchronizaVon	
  
         –  Temporal	
  dependencies	
  are	
  complicated	
  
         –  It	
  is	
  difficult	
  to	
  deal	
  with	
  parVal	
  failures	
  of	
  the	
  system	
  
•  Data	
  typically	
  stored	
  on	
  SAN	
  	
  
•  Data	
  brought	
  to	
  compute	
  nodes	
  @	
  runVme	
  

9/24/11	
                                      PyCon	
  UK	
  2011	
  
Desired	
  Features	
  in	
  a	
  Large	
  Scale	
  Data	
  Systems	
  

•  Data	
  Driven	
  
         –  A	
  new	
  improved	
  system	
  should	
  avoid	
  data	
  
            boKlenecks	
  
•      Scalable	
  
•      Consistent	
  
•      Recoverable	
  	
  (	
  Data	
  /	
  Processor	
  )	
  
•      ParVal	
  Failure	
  Support	
  


9/24/11	
                              PyCon	
  UK	
  2011	
  
What	
  Hadoop	
  offers	
  
•  Provides	
  a	
  high	
  level	
  programming	
  model	
  
         –  No	
  worries	
  for	
  Locking/Temporal	
  Dependencies,	
  
            Sockets	
  ..	
  
•  and	
  the	
  list	
  of	
  features	
  in	
  the	
  desired	
  list	
  J	
  
     (	
  previous	
  slide	
  )	
  
	
  



9/24/11	
                             PyCon	
  UK	
  2011	
  
History	
  
•  Hadoop	
  is	
  based	
  on	
  work	
  done	
  by	
  Google	
  in	
  
   the	
  late	
  1990s/early	
  2000s	
  
•  Specifically,	
  on	
  papers	
  describing	
  the	
  Google	
  
   File	
  System	
  (GFS)published	
  in	
  2003,	
  and	
  Map/
   Reduce	
  published	
  in	
  2004	
  
•  Hadoop	
  MapReduce	
  NextGeneraVon	
  –	
  2011	
  
         –  hKp://developer.yahoo.com/blogs/hadoop/
            posts/2011/02/mapreduce-­‐nextgen/	
  

9/24/11	
                       PyCon	
  UK	
  2011	
  
Apache	
  Hadoop	
  Ecosystem	
  
              •    Hadoop	
  Common:	
  The	
  common	
  uVliVes	
  that	
  support	
  the	
  other	
  Hadoop	
  subprojects.	
  
              •    Hadoop	
  Distributed	
  File	
  System	
  (HDFS™):	
  A	
  distributed	
  file	
  system	
  that	
  provides	
  high-­‐
                   throughput	
  access	
  to	
  applicaVon	
  data.	
  
              •    Hadoop	
  MapReduce:	
  A	
  so=ware	
  framework	
  for	
  distributed	
  processing	
  of	
  large	
  data	
  sets	
  
                   on	
  compute	
  clusters.	
  

              Other	
  Hadoop-­‐related	
  projects	
  at	
  Apache	
  include:	
  
              •  Cassandra™:	
  A	
  scalable	
  mulV-­‐master	
  database	
  with	
  no	
  single	
  points	
  of	
  failure.	
  
              •  HBase™:	
  A	
  scalable,	
  distributed	
  database	
  that	
  supports	
  structured	
  data	
  storage	
  for	
  large	
  
                 tables.	
  
              •  Hive™:	
  A	
  data	
  warehouse	
  infrastructure	
  that	
  provides	
  data	
  summarizaVon	
  and	
  ad	
  hoc	
  
                 querying.	
  
              •  Mahout™:	
  A	
  Scalable	
  machine	
  learning	
  and	
  data	
  mining	
  library.	
  
              •  Pig™:	
  A	
  high-­‐level	
  data-­‐flow	
  language	
  and	
  execuVon	
  framework	
  for	
  parallel	
  
                 computaVon.	
  




               Source	
  :	
  hKp://hadoop.apache.org/	
  	
  
9/24/11	
                                                              PyCon	
  UK	
  2011	
  
Hadoop	
  Key	
  Daemon	
  Processes	
  
•      Namenode	
  
•      Secondary	
  NameNode	
  
•      DataNode	
  
•      JobTracker	
  
•      TaskTracker	
  




9/24/11	
                   PyCon	
  UK	
  2011	
  
High	
  level	
  Hadoop	
  cluster	
  view	
  




9/24/11	
                    PyCon	
  UK	
  2011	
  
MapReduce	
  Data	
  Flow	
  




9/24/11	
                PyCon	
  UK	
  2011	
  
HDFS	
  Architecture	
  




9/24/11	
             PyCon	
  UK	
  2011	
  
HDFS	
  ReplicaVon	
  




9/24/11	
             PyCon	
  UK	
  2011	
  
Map	
  Reduce	
  Program	
  Components	
  
•  MapReduce	
  programs	
  generally	
  consist	
  of	
  
   three	
  porVons	
  
         –  	
  The	
  Mapper	
  
         –  	
  The	
  Reducer	
  
         –  The	
  driver	
  code	
  
•  AddiVonal	
  components	
  :	
  
         –  Combiner	
  (o=en	
  the	
  same	
  code	
  as	
  the	
  Reducer)	
  
         –  Custom	
  ParVVoner	
  

9/24/11	
                               PyCon	
  UK	
  2011	
  
Hadoop	
  Is	
  /	
  Is	
  Not	
  
•  High	
  Bandwidth,	
  High	
  Latency	
  System	
  
•  Not	
  a	
  subsVtute	
  for	
  a	
  DBMS,	
  not	
  alone	
  at-­‐least	
  
•  HDFS	
  is	
  not	
  yet	
  a	
  Highly	
  Available	
  FS.	
  
   NameNode	
  is	
  a	
  SPOF	
  
•  Is	
  a	
  “Share	
  nothing”	
  Architecture	
  
         –  Mappers	
  do	
  not	
  talk,	
  neither	
  do	
  Reducers	
  



9/24/11	
                               PyCon	
  UK	
  2011	
  
Ge[ng	
  started	
  yourself	
  
        Requirements	
  :	
  	
  
            •  Java	
  SE	
  SDK	
  [download	
  JDK	
  6	
  or	
  higher	
  )	
  
            •  Download	
  and	
  Install	
  	
  
                   Hadoop	
  Common	
  	
  :	
  0.20.203.X	
  -­‐	
  current	
  stable	
  version	
  
                   Hadoop	
  HDFS	
  :	
  0.21	
  –	
  stable	
  version	
  
                   Hadoop	
  MapReduce	
  :	
  0.21	
  –	
  stable	
  version	
  
            •  Subscribe	
  to	
  mailing	
  lists	
  	
  for	
  Hadoop	
  subprojects,	
  depending	
  on	
  your	
  
                 role	
  
            •  AddiVonally/AlternaVvely	
  one	
  can	
  setup	
  VMs	
  from	
  Cloudera	
  /	
  Yahoo	
  
            	
  
            •  Details	
  :	
  
                   •  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop	
  
                   •  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic	
  
                   	
  


9/24/11	
                                             PyCon	
  UK	
  2011	
  
Simple	
  Demo	
  
•  Using	
  
         –  Pig	
  	
  
         –  Map/Reduce	
  




9/24/11	
                    PyCon	
  UK	
  2011	
  
Streaming	
  Jobs 	
  	
  
•  Any	
  language	
  that	
  can	
  read	
  from	
  stdin	
  and	
  write	
  to	
  stdout	
  
•  hadoop	
  jar	
  $HADOOP_HOME/hadoop-­‐streaming.jar	
  	
  
     	
  -­‐input	
  myInputDirs	
  	
  
     	
  -­‐output	
  myOutputDir	
  	
  
     	
  -­‐mapper	
  myMapScript.py	
  	
  
     	
  -­‐reducer	
  myReduceScript.py	
  	
  
     	
  -­‐file	
  myMapScript.py	
  	
  
     	
  -­‐file	
  myReduceScript.py	
  
	
  




9/24/11	
                                           PyCon	
  UK	
  2011	
  
Companies	
  involved	
  
•  Yahoo	
  	
  -­‐	
  4500	
  nodes	
  cluster	
  (	
  2*4	
  cores,	
  4*1	
  TBs	
  
                  Disk	
  ,	
  16GB	
  RAM	
  )	
  –	
  (	
  AdServer,	
  Search	
  )	
  
•  HortonWorks	
  ,	
  Cloudera	
  
•  Facebook	
  
•  A9	
  	
  (	
  Amazon	
  Product	
  Search	
  )	
  
•  EBay	
  -­‐	
  532	
  node	
  cluster	
  –	
  (	
  8	
  *	
  532	
  cores	
  ,	
  5.3	
  PB	
  )	
  
•  Last.fm,	
  TwiKer	
  …	
  
•  ……	
  a	
  lot	
  more	
  can	
  be	
  found	
  on	
  the	
  link	
  below	
  :	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  hKp://wiki.apache.org/hadoop/PoweredBy	
  


 9/24/11	
                                         PyCon	
  UK	
  2011	
  
Useful	
  Links	
  
•                                      	
  
       hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop	
  -­‐	
  Ge[ng	
  Started	
  

•  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html	
  -­‐	
  Cluster	
  
     Setup	
  
	
  
•  hKp://developer.yahoo.com/hadoop/tutorial/module4.html	
  -­‐	
  MapReduce	
  

•  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html	
  -­‐	
  PIG	
  

•  hKp://hadoop.apache.org/common/docs/current/api/index.html	
  -­‐	
  APIs	
  
	
  
•  hKp://developer.yahoo.com/hadoop/tutorial/	
  -­‐	
  YDN	
  resource	
  on	
  Hadoop	
  




9/24/11	
                                 PyCon	
  UK	
  2011	
  
Q&C	
  




Contact	
  InformaFon	
  :	
  
	
  
Aditya	
  Sakhuja	
  
aditya@sakhuja.us	
  
hKp://twiKer.com/sakhuja	
  
hKp://linkedin.com/in/adityasakhuja	
  
	
  
	
  
9/24/11	
                                 PyCon	
  UK	
  2011	
  

Weitere ähnliche Inhalte

Was ist angesagt?

Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 

Was ist angesagt? (20)

HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
YARN - Strata 2014
YARN - Strata 2014YARN - Strata 2014
YARN - Strata 2014
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
Meet hbase 2.0
Meet hbase 2.0Meet hbase 2.0
Meet hbase 2.0
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix clusterFive major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
 
HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014HBase Applications - Atlanta HUG - May 2014
HBase Applications - Atlanta HUG - May 2014
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 

Ähnlich wie Hadoop pycon2011uk

INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer
 

Ähnlich wie Hadoop pycon2011uk (20)

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Apache HBase: Where We've Been and What's Upcoming
Apache HBase: Where We've Been and What's UpcomingApache HBase: Where We've Been and What's Upcoming
Apache HBase: Where We've Been and What's Upcoming
 
INTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Bn1028 demo hadoop administration and development
Bn1028 demo  hadoop administration and developmentBn1028 demo  hadoop administration and development
Bn1028 demo hadoop administration and development
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Hadoop pycon2011uk

  • 1. Respect  for  the  elephant  –  Hadoop   Aditya  Sakhuja   aditya@sakhuja.us    
  • 2. Whoami   •  So=ware  Engineer  @  Yahoo  Inc.     •  Web  Search  -­‐>  Cloud  PlaHorms  -­‐>  Display  Ads  Serving     •  hKp://linkedin.com/in/adityasakhuja     9/24/11   PyCon  UK  2011  
  • 3. Agenda   •  MoVvaVon   •  History   •  Ecosystem   •  Daemon  processes  /  High  Level  View   •  Map  Reduce  Data  Flow   •  HDFS  Architecture  /  ReplicaVon   •  Can  /  Cannot   •  Ge[ng  started  yourself   •  Demo   •  Companies  Involved   •  Q&A   9/24/11   PyCon  UK  2011  
  • 4. MoVvaVon   •  ‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐   problems   •  Desired  features  in  an  improved  system   •  How  Hadoop  addresses  them   9/24/11   PyCon  UK  2011  
  • 5. ‘TradiVonal’  large-­‐scale  compuVng  systems  -­‐   problems     •  CPU  intensive  over  Data  intensive   •  MPI  ,  PVM,    RPCs  –  Parallel  ComputaVon   Frameworks   •  Programming  for  tradiVonal  distributed  systems   is  complex   –  Data  exchange  requires  synchronizaVon   –  Temporal  dependencies  are  complicated   –  It  is  difficult  to  deal  with  parVal  failures  of  the  system   •  Data  typically  stored  on  SAN     •  Data  brought  to  compute  nodes  @  runVme   9/24/11   PyCon  UK  2011  
  • 6. Desired  Features  in  a  Large  Scale  Data  Systems   •  Data  Driven   –  A  new  improved  system  should  avoid  data   boKlenecks   •  Scalable   •  Consistent   •  Recoverable    (  Data  /  Processor  )   •  ParVal  Failure  Support   9/24/11   PyCon  UK  2011  
  • 7. What  Hadoop  offers   •  Provides  a  high  level  programming  model   –  No  worries  for  Locking/Temporal  Dependencies,   Sockets  ..   •  and  the  list  of  features  in  the  desired  list  J   (  previous  slide  )     9/24/11   PyCon  UK  2011  
  • 8. History   •  Hadoop  is  based  on  work  done  by  Google  in   the  late  1990s/early  2000s   •  Specifically,  on  papers  describing  the  Google   File  System  (GFS)published  in  2003,  and  Map/ Reduce  published  in  2004   •  Hadoop  MapReduce  NextGeneraVon  –  2011   –  hKp://developer.yahoo.com/blogs/hadoop/ posts/2011/02/mapreduce-­‐nextgen/   9/24/11   PyCon  UK  2011  
  • 9. Apache  Hadoop  Ecosystem   •  Hadoop  Common:  The  common  uVliVes  that  support  the  other  Hadoop  subprojects.   •  Hadoop  Distributed  File  System  (HDFS™):  A  distributed  file  system  that  provides  high-­‐ throughput  access  to  applicaVon  data.   •  Hadoop  MapReduce:  A  so=ware  framework  for  distributed  processing  of  large  data  sets   on  compute  clusters.   Other  Hadoop-­‐related  projects  at  Apache  include:   •  Cassandra™:  A  scalable  mulV-­‐master  database  with  no  single  points  of  failure.   •  HBase™:  A  scalable,  distributed  database  that  supports  structured  data  storage  for  large   tables.   •  Hive™:  A  data  warehouse  infrastructure  that  provides  data  summarizaVon  and  ad  hoc   querying.   •  Mahout™:  A  Scalable  machine  learning  and  data  mining  library.   •  Pig™:  A  high-­‐level  data-­‐flow  language  and  execuVon  framework  for  parallel   computaVon.   Source  :  hKp://hadoop.apache.org/     9/24/11   PyCon  UK  2011  
  • 10. Hadoop  Key  Daemon  Processes   •  Namenode   •  Secondary  NameNode   •  DataNode   •  JobTracker   •  TaskTracker   9/24/11   PyCon  UK  2011  
  • 11. High  level  Hadoop  cluster  view   9/24/11   PyCon  UK  2011  
  • 12. MapReduce  Data  Flow   9/24/11   PyCon  UK  2011  
  • 13. HDFS  Architecture   9/24/11   PyCon  UK  2011  
  • 14. HDFS  ReplicaVon   9/24/11   PyCon  UK  2011  
  • 15. Map  Reduce  Program  Components   •  MapReduce  programs  generally  consist  of   three  porVons   –   The  Mapper   –   The  Reducer   –  The  driver  code   •  AddiVonal  components  :   –  Combiner  (o=en  the  same  code  as  the  Reducer)   –  Custom  ParVVoner   9/24/11   PyCon  UK  2011  
  • 16. Hadoop  Is  /  Is  Not   •  High  Bandwidth,  High  Latency  System   •  Not  a  subsVtute  for  a  DBMS,  not  alone  at-­‐least   •  HDFS  is  not  yet  a  Highly  Available  FS.   NameNode  is  a  SPOF   •  Is  a  “Share  nothing”  Architecture   –  Mappers  do  not  talk,  neither  do  Reducers   9/24/11   PyCon  UK  2011  
  • 17. Ge[ng  started  yourself   Requirements  :     •  Java  SE  SDK  [download  JDK  6  or  higher  )   •  Download  and  Install     Hadoop  Common    :  0.20.203.X  -­‐  current  stable  version   Hadoop  HDFS  :  0.21  –  stable  version   Hadoop  MapReduce  :  0.21  –  stable  version   •  Subscribe  to  mailing  lists    for  Hadoop  subprojects,  depending  on  your   role   •  AddiVonally/AlternaVvely  one  can  setup  VMs  from  Cloudera  /  Yahoo     •  Details  :   •  hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop   •  hKp://developer.yahoo.com/hadoop/tutorial/module7.html#basic     9/24/11   PyCon  UK  2011  
  • 18. Simple  Demo   •  Using   –  Pig     –  Map/Reduce   9/24/11   PyCon  UK  2011  
  • 19. Streaming  Jobs     •  Any  language  that  can  read  from  stdin  and  write  to  stdout   •  hadoop  jar  $HADOOP_HOME/hadoop-­‐streaming.jar      -­‐input  myInputDirs      -­‐output  myOutputDir      -­‐mapper  myMapScript.py      -­‐reducer  myReduceScript.py      -­‐file  myMapScript.py      -­‐file  myReduceScript.py     9/24/11   PyCon  UK  2011  
  • 20. Companies  involved   •  Yahoo    -­‐  4500  nodes  cluster  (  2*4  cores,  4*1  TBs   Disk  ,  16GB  RAM  )  –  (  AdServer,  Search  )   •  HortonWorks  ,  Cloudera   •  Facebook   •  A9    (  Amazon  Product  Search  )   •  EBay  -­‐  532  node  cluster  –  (  8  *  532  cores  ,  5.3  PB  )   •  Last.fm,  TwiKer  …   •  ……  a  lot  more  can  be  found  on  the  link  below  :                        hKp://wiki.apache.org/hadoop/PoweredBy   9/24/11   PyCon  UK  2011  
  • 21. Useful  Links   •    hKp://wiki.apache.org/hadoop/Ge[ngStartedWithHadoop  -­‐  Ge[ng  Started   •  hKp://hadoop.apache.org/common/docs/current/cluster_setup.html  -­‐  Cluster   Setup     •  hKp://developer.yahoo.com/hadoop/tutorial/module4.html  -­‐  MapReduce   •  hKp://developer.yahoo.com/hadoop/tutorial/pigtutorial.html  -­‐  PIG   •  hKp://hadoop.apache.org/common/docs/current/api/index.html  -­‐  APIs     •  hKp://developer.yahoo.com/hadoop/tutorial/  -­‐  YDN  resource  on  Hadoop   9/24/11   PyCon  UK  2011  
  • 22. Q&C   Contact  InformaFon  :     Aditya  Sakhuja   aditya@sakhuja.us   hKp://twiKer.com/sakhuja   hKp://linkedin.com/in/adityasakhuja       9/24/11   PyCon  UK  2011