SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Apache Bigtop Working Group
7/14/2013
Basic Skills
Hadoop Pipelines (Roman's/Ron's Idea)
Career positioning
Basic Skills
● Working group, you set your own goals. Structure: do a demo in
front of the class. Focus on skills employers are looking for.
● Cluster skills using AWS; create instances, ec2-api, will have to
extend this using scripts or your own code. Have to demo some
skill
– Goal:Manage multiple instances. You can do this manually but the number
of keystrokes goes up exponentially as you add new components. Need
some automation or code.
– Bash scripts are good b/c they are used in Bigtop init.d files and Roman's
code, e.g. copy the mkdir commands into script and run them.
Basic Skills
● Hadoop*, all the features of 2.0.0. No training
course can give this to you. You will have to
manually do this.
– Use 2.0.X unit test code as a base
Hadoop 2.0.0
● Basic FS Review:
– Copy On Write
– Write Through/Write Back, FSCK
– Inodes/BTrees, NN/DN
Working Group
● Not a class which gives you answers. The
answers classes give you are too simple to be
valuable.
● E.g.; Does YARN/Hadoop 2.0.X support
multitenancy? Multiple users/companies cant
see each other's data and if they run a query,
they can't crash the cluster for other users. This
isn't the case now.
Hadoop 2.0.X
● Zookeeper in HDFS, requires some
administration. Do you need to do a rollback of
zookeeper logs when a zk cluster fails?
Bigtop Basic Skills
● Run Bigtop in AWS in distributed mode, start
w/HDFS
● Create Hadoop* pipelines (Roman's/Ron's idea)
– Ron: book. Great idea!!!!!
● Run mvn verify/learn to debug and write tests hers
● Will take months, demo driven. People do demos.
Career positioning
● Choose where to spend time.
● Bigdata =
– Devops
– App development (Astyanax)
– Internals
● Don't get distracted into 3). Not enough time to do all well. Let
Cloudera ppl help you.
● Do something new that people care about
– Don't try to be better than people w/the same job skill
– Learn efficiently, practice, practice, practice, Can't learn by watching
Big Company vs. Small
● Big:
– Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access
from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest
data like w/flume a sys admin has to set this up. e.g. Don't spend time getting
flume to work in Hue. But make sure you know 2.0.x security models/LDAP,
pipeline debugging when things get stuck, failover, application development
– HUE != Ambari. Why?
– Value to building apps in HUE or w/HUE. Approach for webapps changing away
from HUE to something like Ambari which is a simpler user defined MVC pattern.
– User defined MVC better. Why? Think like a manager and what happens as
Django adds more complicated features?
– e.g. Jetty/J2EE example
Small
● Do everything, use BT, get to working app as
fast as possible. 1) and 2) very important. Have
to do things quickly.
● You decide how to spend your own time
Structure
● Schedule 3x meetings after this every 2 weeks
● Individual demos
● Install Bigtop, demo WC, PI, demo components
and pipelines.
● Turn pipeline demos into integration tests
● Test on pseudo distributed mode and cluster
● Listen to Roman: Hue....
HBase/Hadoop
● HBase requirements: R/S 48GB, 8-12
cores/node
Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB,
HDFS
● Disk: 25% for shuffle files for HDFS, <50% full,
JBOD, no RAID
Starting Hadoop, M/R
● Look at the logs /var/log/hadoop-hdfs
● Cluster ID: under ~/cache/.../data, VERSION,
change the text. DEMO
● No connection, check ping, check core-site.xml,
/etc/hosts
● M/R/Yarn: mapred-site.xml. NOTE: M/R uses
port 8021 and so does NAMENODE. Keep this
port, run on differeent server; open port 8031
● Telnet jt:8021, turn off iptables, disable selinux
M/R Setup
● 1 node manager
– WRONG_REDUCE=0
– File Input Format Counters
– Bytes Read=1180
– File Output Format Counters
– Bytes Written=97
– Job Finished in 92.72 seconds
– Estimated value of Pi is 3.14080000000000000000
–
M/R AWS
● 3 nodemanagers
● File Output Format Counters
● Bytes Written=97
● Job Finished in 86.762 seconds
● Estimated value of Pi is
3.14080000000000000000
●
Zookeeper Administration
Many options for projects
● Integration code testing when Roman gets here
in 2 weeks
● Work w/Ron or Victor on projects
● Update the wiki w/ AWS cluster setup,
automate w/whirr? + chef/puppet?
● Add HBase, Zookeeper management for
Hadoop(monit/supervisord)

Weitere ähnliche Inhalte

Andere mochten auch (7)

Spark Streaming Info
Spark Streaming InfoSpark Streaming Info
Spark Streaming Info
 
Hadoop applicationarchitectures
Hadoop applicationarchitecturesHadoop applicationarchitectures
Hadoop applicationarchitectures
 
Bigtop june302013
Bigtop june302013Bigtop june302013
Bigtop june302013
 
Odersky week1 notes
Odersky week1 notesOdersky week1 notes
Odersky week1 notes
 
Demographics andweblogtargeting
Demographics andweblogtargetingDemographics andweblogtargeting
Demographics andweblogtargeting
 
Bigtop elancesmallrev1
Bigtop elancesmallrev1Bigtop elancesmallrev1
Bigtop elancesmallrev1
 
Training
TrainingTraining
Training
 

Ähnlich wie Apache bigtopwg7142013

How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.
DrupalCampDN
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
Doug Chang
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
Vipul Divyanshu
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
Søren Lund
 

Ähnlich wie Apache bigtopwg7142013 (20)

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...
 
Drupal development
Drupal development Drupal development
Drupal development
 
How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.How to? Drupal developer toolkit. Dennis Povshedny.
How to? Drupal developer toolkit. Dennis Povshedny.
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
 
2014 hadoop wrocław jug
2014 hadoop   wrocław jug2014 hadoop   wrocław jug
2014 hadoop wrocław jug
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep Learning
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop lab s...
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 
Scaling symfony apps
Scaling symfony appsScaling symfony apps
Scaling symfony apps
 
Performance and Scalability
Performance and ScalabilityPerformance and Scalability
Performance and Scalability
 
Efficient development workflows with composer
Efficient development workflows with composerEfficient development workflows with composer
Efficient development workflows with composer
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Drupal Multi-Site Setup
Drupal Multi-Site SetupDrupal Multi-Site Setup
Drupal Multi-Site Setup
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Am...
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 

Kürzlich hochgeladen

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 

Kürzlich hochgeladen (20)

Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 

Apache bigtopwg7142013

  • 1. Apache Bigtop Working Group 7/14/2013 Basic Skills Hadoop Pipelines (Roman's/Ron's Idea) Career positioning
  • 2. Basic Skills ● Working group, you set your own goals. Structure: do a demo in front of the class. Focus on skills employers are looking for. ● Cluster skills using AWS; create instances, ec2-api, will have to extend this using scripts or your own code. Have to demo some skill – Goal:Manage multiple instances. You can do this manually but the number of keystrokes goes up exponentially as you add new components. Need some automation or code. – Bash scripts are good b/c they are used in Bigtop init.d files and Roman's code, e.g. copy the mkdir commands into script and run them.
  • 3. Basic Skills ● Hadoop*, all the features of 2.0.0. No training course can give this to you. You will have to manually do this. – Use 2.0.X unit test code as a base
  • 4. Hadoop 2.0.0 ● Basic FS Review: – Copy On Write – Write Through/Write Back, FSCK – Inodes/BTrees, NN/DN
  • 5. Working Group ● Not a class which gives you answers. The answers classes give you are too simple to be valuable. ● E.g.; Does YARN/Hadoop 2.0.X support multitenancy? Multiple users/companies cant see each other's data and if they run a query, they can't crash the cluster for other users. This isn't the case now.
  • 6. Hadoop 2.0.X ● Zookeeper in HDFS, requires some administration. Do you need to do a rollback of zookeeper logs when a zk cluster fails?
  • 7. Bigtop Basic Skills ● Run Bigtop in AWS in distributed mode, start w/HDFS ● Create Hadoop* pipelines (Roman's/Ron's idea) – Ron: book. Great idea!!!!! ● Run mvn verify/learn to debug and write tests hers ● Will take months, demo driven. People do demos.
  • 8. Career positioning ● Choose where to spend time. ● Bigdata = – Devops – App development (Astyanax) – Internals ● Don't get distracted into 3). Not enough time to do all well. Let Cloudera ppl help you. ● Do something new that people care about – Don't try to be better than people w/the same job skill – Learn efficiently, practice, practice, practice, Can't learn by watching
  • 9. Big Company vs. Small ● Big: – Interpolate Cloudera's strategy. Hadoop 2.0.X runs in the cloud, users access from Desktop via browser, can run Hive/Pig on YOUR data, if you need to ingest data like w/flume a sys admin has to set this up. e.g. Don't spend time getting flume to work in Hue. But make sure you know 2.0.x security models/LDAP, pipeline debugging when things get stuck, failover, application development – HUE != Ambari. Why? – Value to building apps in HUE or w/HUE. Approach for webapps changing away from HUE to something like Ambari which is a simpler user defined MVC pattern. – User defined MVC better. Why? Think like a manager and what happens as Django adds more complicated features? – e.g. Jetty/J2EE example
  • 10. Small ● Do everything, use BT, get to working app as fast as possible. 1) and 2) very important. Have to do things quickly. ● You decide how to spend your own time
  • 11. Structure ● Schedule 3x meetings after this every 2 weeks ● Individual demos ● Install Bigtop, demo WC, PI, demo components and pipelines. ● Turn pipeline demos into integration tests ● Test on pseudo distributed mode and cluster ● Listen to Roman: Hue....
  • 12. HBase/Hadoop ● HBase requirements: R/S 48GB, 8-12 cores/node Memory: M/R 1-2GB+, R/S 32GB+, OS 4-8GB, HDFS ● Disk: 25% for shuffle files for HDFS, <50% full, JBOD, no RAID
  • 13. Starting Hadoop, M/R ● Look at the logs /var/log/hadoop-hdfs ● Cluster ID: under ~/cache/.../data, VERSION, change the text. DEMO ● No connection, check ping, check core-site.xml, /etc/hosts ● M/R/Yarn: mapred-site.xml. NOTE: M/R uses port 8021 and so does NAMENODE. Keep this port, run on differeent server; open port 8031 ● Telnet jt:8021, turn off iptables, disable selinux
  • 14. M/R Setup ● 1 node manager – WRONG_REDUCE=0 – File Input Format Counters – Bytes Read=1180 – File Output Format Counters – Bytes Written=97 – Job Finished in 92.72 seconds – Estimated value of Pi is 3.14080000000000000000 –
  • 15. M/R AWS ● 3 nodemanagers ● File Output Format Counters ● Bytes Written=97 ● Job Finished in 86.762 seconds ● Estimated value of Pi is 3.14080000000000000000 ●
  • 17. Many options for projects ● Integration code testing when Roman gets here in 2 weeks ● Work w/Ron or Victor on projects ● Update the wiki w/ AWS cluster setup, automate w/whirr? + chef/puppet? ● Add HBase, Zookeeper management for Hadoop(monit/supervisord)