SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Apache Bigtop Working Group
7/14/2013
Basic Skills
Hadoop Pipelines (Roman's/Ron's Idea)
Career positioning
Basic Skills
● Working group, you set your own goals. Structure: do a demo in
front of the class. Focus on skills employers are looking for.
● Cluster skills using AWS; create instances, ec2-api, will have to
extend this using scripts or your own code. Have to demo some
skill
– Goal:Manage multiple instances. You can do this manually but the number
of keystrokes goes up exponentially as you add new components. Need
some automation or code.
– Bash scripts are good b/c they are used in Bigtop init.d files and Roman's
code, e.g. copy the mkdir commands into script and run them.
Basic Skills
● Hadoop*, all the features of 2.0.0. No training
course can give this to you. You will have to
manually do this.
– Use 2.0.X unit test code as a base
Hadoop 2.0.0
● Basic FS Review:
– Copy On Write
– Write Through/Write Back, FSCK
– Inodes/BTrees, NN/DN
Working Group
● Not a class which gives you answers. The
answers classes give you are too simple to be
valuable.
● E.g.; Does YARN/Hadoop 2.0.X support
multitenancy? Multiple users/companies cant
see each other's data and if they run a query,
they can't crash the cluster for other users. This
isn't the case now.
Hadoop 2.0.X
● Zookeeper in HDFS, requires some
administration. Do you need to do a rollback of
zookeeper logs when a zk cluster fails?
Bigtop Basic Skills
● Run Bigtop in AWS in distributed mode, start
w/HDFS
● Create Hadoop* pipelines (Roman's/Ron's idea)
– Ron: book. Great idea!!!!!
● Run mvn verify/learn to debug and write tests hers
● Will take months, demo driven. People do demos.
Career positioning
● Choose where to spend time.
● Bigdata =
– Devops
– App development (Astyanax)
– Internals
● Don't get distracted into 3). Not enough time to do all well. Let
Cloudera ppl help you.
● Do something new that people care about
– Don't try to be better than people w/the same job skill
– Learn efficiently, practice, practice, practice, Can't learn by watching
Big Company vs. Small
● Big:
– Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access
from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest
data like w/flume a sys admin has to set this up. e.g. Don't spend time getting
flume to work in Hue. But make sure you know 2.0.x security models/LDAP,
pipeline debugging when things get stuck, failover, application development
– HUE != Ambari. Why?
– Value to building apps in HUE or w/HUE. Approach for webapps changing away
from HUE to something like Ambari which is a simpler user defined MVC pattern.
– User defined MVC better. Why? Think like a manager and what happens as
Django adds more complicated features?
– e.g. Jetty/J2EE example
Small
● Do everything, use BT, get to working app as
fast as possible. 1) and 2) very important. Have
to do things quickly.
● You decide how to spend your own time
Structure
● Schedule 3x meetings after this every 2 weeks
● Individual demos
● Install Bigtop, demo WC, PI, demo components
and pipelines.
● Turn pipeline demos into integration tests
● Test on pseudo distributed mode and cluster
● Listen to Roman: Hue....
HBase/Hadoop
● HBase requirements: R/S 48GB, 8-12
cores/node
Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB,
HDFS
● Disk: 25% for shuffle files for HDFS, <50% full,
JBOD, no RAID
Starting Hadoop, M/R
● Look at the logs /var/log/hadoop-hdfs
● Cluster ID: under ~/cache/.../data, VERSION,
change the text. DEMO
● No connection, check ping, check core-site.xml,
/etc/hosts
● M/R/Yarn: mapred-site.xml. NOTE: M/R uses
port 8021 and so does NAMENODE. Keep this
port, run on differeent server; open port 8031
● Telnet jt:8021, turn off iptables, disable selinux
M/R Setup
● 1 node manager
– WRONG_REDUCE=0
– File Input Format Counters
– Bytes Read=1180
– File Output Format Counters
– Bytes Written=97
– Job Finished in 92.72 seconds
– Estimated value of Pi is 3.14080000000000000000
–
M/R AWS
● 3 nodemanagers
● File Output Format Counters
● Bytes Written=97
● Job Finished in 86.762 seconds
● Estimated value of Pi is
3.14080000000000000000
●
Zookeeper Administration
Many options for projects
● Integration code testing when Roman gets here
in 2 weeks
● Work w/Ron or Victor on projects
● Update the wiki w/ AWS cluster setup,
automate w/whirr? + chef/puppet?
● Add HBase, Zookeeper management for
Hadoop(monit/supervisord)

Weitere ähnliche Inhalte

Andere mochten auch

Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming InfoDoug Chang
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitecturesDoug Chang
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013Doug Chang
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notesDoug Chang
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargetingDoug Chang
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1Doug Chang
 

Andere mochten auch (7)

Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming Info
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitectures
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notes
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargeting
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1
 
Training
TrainingTraining
Training
 

Ähnlich wie Apache bigtopwg7142013

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal
 
How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.DrupalCampDN
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintroDoug Chang
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep LearningGreg Gandenberger
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...Leons Petražickis
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceAshok Modi
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and ScalabilityMediacurrent
 
Efficient development workflows with composer
Efficient development workflows with composerEfficient development workflows with composer
Efficient development workflows with composernuppla
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Drupal Multi-Site Setup
Drupal Multi-Site SetupDrupal Multi-Site Setup
Drupal Multi-Site Setupylynfatt
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...Alexander Dean
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 

Ähnlich wie Apache bigtopwg7142013 (20)

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Drupal development
Drupal development Drupal development
Drupal development
 
How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep Learning
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 
Scaling symfony apps
Scaling symfony appsScaling symfony apps
Scaling symfony apps
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and Scalability
 
Efficient development workflows with composer
Efficient development workflows with composerEfficient development workflows with composer
Efficient development workflows with composer
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Drupal Multi-Site Setup
Drupal Multi-Site SetupDrupal Multi-Site Setup
Drupal Multi-Site Setup
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 

Kürzlich hochgeladen

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Apache bigtopwg7142013

  • 1. Apache Bigtop Working Group 7/14/2013 Basic Skills Hadoop Pipelines (Roman's/Ron's Idea) Career positioning
  • 2. Basic Skills ● Working group, you set your own goals. Structure: do a demo in front of the class. Focus on skills employers are looking for. ● Cluster skills using AWS; create instances, ec2-api, will have to extend this using scripts or your own code. Have to demo some skill – Goal:Manage multiple instances. You can do this manually but the number of keystrokes goes up exponentially as you add new components. Need some automation or code. – Bash scripts are good b/c they are used in Bigtop init.d files and Roman's code, e.g. copy the mkdir commands into script and run them.
  • 3. Basic Skills ● Hadoop*, all the features of 2.0.0. No training course can give this to you. You will have to manually do this. – Use 2.0.X unit test code as a base
  • 4. Hadoop 2.0.0 ● Basic FS Review: – Copy On Write – Write Through/Write Back, FSCK – Inodes/BTrees, NN/DN
  • 5. Working Group ● Not a class which gives you answers. The answers classes give you are too simple to be valuable. ● E.g.; Does YARN/Hadoop 2.0.X support multitenancy? Multiple users/companies cant see each other's data and if they run a query, they can't crash the cluster for other users. This isn't the case now.
  • 6. Hadoop 2.0.X ● Zookeeper in HDFS, requires some administration. Do you need to do a rollback of zookeeper logs when a zk cluster fails?
  • 7. Bigtop Basic Skills ● Run Bigtop in AWS in distributed mode, start w/HDFS ● Create Hadoop* pipelines (Roman's/Ron's idea) – Ron: book. Great idea!!!!! ● Run mvn verify/learn to debug and write tests hers ● Will take months, demo driven. People do demos.
  • 8. Career positioning ● Choose where to spend time. ● Bigdata = – Devops – App development (Astyanax) – Internals ● Don't get distracted into 3). Not enough time to do all well. Let Cloudera ppl help you. ● Do something new that people care about – Don't try to be better than people w/the same job skill – Learn efficiently, practice, practice, practice, Can't learn by watching
  • 9. Big Company vs. Small ● Big: – Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest data like w/flume a sys admin has to set this up. e.g. Don't spend time getting flume to work in Hue. But make sure you know 2.0.x security models/LDAP, pipeline debugging when things get stuck, failover, application development – HUE != Ambari. Why? – Value to building apps in HUE or w/HUE. Approach for webapps changing away from HUE to something like Ambari which is a simpler user defined MVC pattern. – User defined MVC better. Why? Think like a manager and what happens as Django adds more complicated features? – e.g. Jetty/J2EE example
  • 10. Small ● Do everything, use BT, get to working app as fast as possible. 1) and 2) very important. Have to do things quickly. ● You decide how to spend your own time
  • 11. Structure ● Schedule 3x meetings after this every 2 weeks ● Individual demos ● Install Bigtop, demo WC, PI, demo components and pipelines. ● Turn pipeline demos into integration tests ● Test on pseudo distributed mode and cluster ● Listen to Roman: Hue....
  • 12. HBase/Hadoop ● HBase requirements: R/S 48GB, 8-12 cores/node Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB, HDFS ● Disk: 25% for shuffle files for HDFS, <50% full, JBOD, no RAID
  • 13. Starting Hadoop, M/R ● Look at the logs /var/log/hadoop-hdfs ● Cluster ID: under ~/cache/.../data, VERSION, change the text. DEMO ● No connection, check ping, check core-site.xml, /etc/hosts ● M/R/Yarn: mapred-site.xml. NOTE: M/R uses port 8021 and so does NAMENODE. Keep this port, run on differeent server; open port 8031 ● Telnet jt:8021, turn off iptables, disable selinux
  • 14. M/R Setup ● 1 node manager – WRONG_REDUCE=0 – File Input Format Counters – Bytes Read=1180 – File Output Format Counters – Bytes Written=97 – Job Finished in 92.72 seconds – Estimated value of Pi is 3.14080000000000000000 –
  • 15. M/R AWS ● 3 nodemanagers ● File Output Format Counters ● Bytes Written=97 ● Job Finished in 86.762 seconds ● Estimated value of Pi is 3.14080000000000000000 ●
  • 17. Many options for projects ● Integration code testing when Roman gets here in 2 weeks ● Work w/Ron or Victor on projects ● Update the wiki w/ AWS cluster setup, automate w/whirr? + chef/puppet? ● Add HBase, Zookeeper management for Hadoop(monit/supervisord)