SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Dynamic Hadoop Clusters Steve Loughran Julio Guijarro http://wiki.smartfrog.org/wiki/display/sf/Dynamic+Hadoop+Clusters
27 April 2009
Hadoop on a cluster 27 April 2009 1 namenode, 1+ Job Tracker, many data nodes and task trackers
Cluster design 27 April 2009 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Management problems big applications ,[object Object],[object Object],[object Object],27 April 2009
The hand-managed cluster 27 April 2009 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The locked-down cluster 27 April 2009 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How do you configure Hadoop 0.21? 27 April 2009
27 April 2009 ,[object Object],[object Object],[object Object],cloudera.com/hadoop
27 April 2009
Configuration in RPMs 27 April 2009 +Push out, rollback via kickstart. - Extra build step, may need kickstart server
clusterssh: cssh 27 April 2009 If all the machines start in the same initial state, they should end up in the same exit state
CM-Managed Hadoop 27 April 2009 Resource Manager keeps cluster live; talks to infrastructure Persistent data store for input data and results
Configuration Management tools 27 April 2009 CM tools are how to manage big clusters State Driven Workflow Centralized Radia, ITIL, lcfg Puppet Decentralized bcfg2, SmartFrog Perl scripts, makefiles
SmartFrog - HPLabs' CM tool ,[object Object],[object Object],[object Object],[object Object],[object Object],27 April 2009
Model the system in the SmartFrog language TwoNodeHDFS extends OneNodeHDFS { localDataDir2 extends TempDirWithCleanup { } datanode2 extends datanode { dataDirectories [LAZY localDataDir2]; dfs.datanode.https.address "https://0.0.0.0:8020"; } } 27 April 2009 Inheritance, cross-referencing, templating extending an existing template  a temporary directory component extend and override with new values, including a reference to the temporary directory
The runtime deploys the model 27 April 2009 Add better diagram
DEMO 27 April 2009
HADOOP-3628: A lifecycle for services 27 April 2009
Base Service class for all nodes public class Service extends Configured implements Closeable { public void start() throws IOException; public void innerPing(ServiceStatus status)    throws IOException; void close() throws IOException; State getLifecycleState(); public enum State { UNDEFINED, CREATED, STARTED, LIVE, FAILED, CLOSED } } 27 April 2009
Subclasses implement  transitions public class NameNode extends Service implements ClientProtocol, NamenodeProtocol, ... { protected void innerStart() throws IOException { initialize(bindAddress, getConf()); setServiceState(ServiceState.LIVE); } public void innerClose() throws IOException { if (server != null) { server.stop(); server = null; } ... } } 27 April 2009
Health and Liveness: ping() ‏ public class DataNode extends Service { ...   public void innerPing(ServiceStatus status)    throws IOException { if (ipcServer == null) { status.addThrowable( new LivenessException("No IPC Server running")); } if (dnRegistration == null) { status.addThrowable( new LivenessException("Not bound to a namenode")); } } 27 April 2009
Liveness issues ,[object Object],[object Object],[object Object],27 April 2009 How unavailable should the nodes be before a cluster is "unhealthy" ?
Replace hadoop-*.xml with .sf files NameNode extends FileSystemNode { nameDirectories TBD; dataDirectories TBD; logDir TBD; dfs.http.address "http://0.0.0.0:8021"; dfs.namenode.handler.count 10; dfs.namenode.decommission.interval (5 * 60); dfs.name.dir TBD; dfs.permissions.supergroup "supergroup"; dfs.upgrade.permission "0777" dfs.replication 3; dfs.replication.interval 3; . . . } 27 April 2009
Hadoop Cluster under SmartFrog 27 April 2009
Aggregated logs 17:39:08 [JobTracker] INFO mapred.ExtJobTracker : State change: JobTracker is now LIVE 17:39:08 [JobTracker] INFO mapred.JobTracker : Restoration complete 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting interTrackerServer 17:39:08 [IPC Server Responder] INFO ipc.Server : IPC Server Responder: starting 17:39:08 [IPC Server listener on 8012] INFO ipc.Server : IPC Server listener on 8012: starting 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting RUNNING 17:39:08 [Map-events fetcher for all reduce tasks on tracker_localhost:localhost/127.0.0.1:34072] INFO mapred.TaskTracker : Starting thread: Map-events fetcher for all reduce tasks on tracker_localhost:localhost/127.0.0.1:34072 17:39:08:960 GMT [INFO ][TaskTracker] HOST localhost:rootProcess:cluster - TaskTracker deployment complete: service is: tracker_localhost:localhost/127.0.0.1:34072 instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:08 [TaskTracker] INFO mapred.ExtTaskTracker : Task Tracker Service is being offered: tracker_localhost:localhost/127.0.0.1:34072 instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:09 [IPC Server handler 5 on 8012] INFO net.NetworkTopology : Adding a new node: /default-rack/localhost 17:39:09 [TaskTracker] INFO mapred.ExtTaskTracker : State change: TaskTracker is now LIVE 27 April 2009
File and Job operations DFS manipulation:  DfsCreateDir, DfsDeleteDir, DfsListDir, DfsPathExists, DfsFormatFileSystem, DFS I/O: DfsCopyFileIn, DfsCopyFileOut 27 April 2009 TestJob extends BlockingJobSubmitter { name "test-job"; cluster LAZY PARENT:cluster; jobTracker LAZY PARENT:cluster; mapred.child.java.opts "-Xmx512m"; mapred.tasktracker.map.tasks.maximum 5; mapred.tasktracker.reduce.tasks.maximum 1; mapred.map.max.attempts 1; mapred.reduce.max.attempts 1; }
What does this let us do? ,[object Object],[object Object],[object Object],[object Object],[object Object],27 April 2009
Status as of June 2009 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],27 April 2009
Issue: Hadoop configuration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],27 April 2009
Issue: VM performance ,[object Object],[object Object],[object Object],[object Object],[object Object],27 April 2009 Cluster availability is often more important than absolute performance
Issue: binding on a dynamic network ,[object Object],[object Object],[object Object],[object Object],27 April 2009
Call to action ,[object Object],[object Object],[object Object],[object Object],21 June 2009
27 April 2009
XML in SCM-managed filesystem 27 April 2009 +push out, rollback. - Need to restart cluster, SPOF?
Configuration-in-database ,[object Object],[object Object],[object Object],[object Object],27 April 2009
Configuration with LDAP 27 April 2009 +View, change settings; High Availability - Not under SCM; rollback and preflight hard

Weitere ähnliche Inhalte

Was ist angesagt?

Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Continuent
 
Install and upgrade Oracle grid infrastructure 12.1.0.2
Install and upgrade Oracle grid infrastructure 12.1.0.2Install and upgrade Oracle grid infrastructure 12.1.0.2
Install and upgrade Oracle grid infrastructure 12.1.0.2Biju Thomas
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSteve Loughran
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Tungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleTungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleContinuent
 
Les défis des architectures cloud sur OpenStack
Les défis des architectures cloud sur OpenStackLes défis des architectures cloud sur OpenStack
Les défis des architectures cloud sur OpenStackOsones
 
[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network Troubleshooting[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network TroubleshootingOpen Source Consulting
 
Colvin RMAN New Features
Colvin RMAN New FeaturesColvin RMAN New Features
Colvin RMAN New FeaturesEnkitec
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNDataWorks Summit
 
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our ServersHTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our ServersJean-Frederic Clere
 
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksCloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksScott Jenner
 
Oracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience SharingOracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience SharingYury Velikanov
 
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...OpenStack Korea Community
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
 
Install Solaris 11.1 on a Virtualbox VM
Install Solaris 11.1 on a Virtualbox VMInstall Solaris 11.1 on a Virtualbox VM
Install Solaris 11.1 on a Virtualbox VMLaurent Leturgez
 
Tungsten Replicator tutorial
Tungsten Replicator tutorialTungsten Replicator tutorial
Tungsten Replicator tutorialGiuseppe Maxia
 
Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016StackIQ
 
Setup oracle golden gate 11g replication
Setup oracle golden gate 11g replicationSetup oracle golden gate 11g replication
Setup oracle golden gate 11g replicationKanwar Batra
 

Was ist angesagt? (19)

Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
Tungsten University: MySQL Multi-Master Operations Made Simple With Tungsten ...
 
Install and upgrade Oracle grid infrastructure 12.1.0.2
Install and upgrade Oracle grid infrastructure 12.1.0.2Install and upgrade Oracle grid infrastructure 12.1.0.2
Install and upgrade Oracle grid infrastructure 12.1.0.2
 
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object storesSpark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Tungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And OracleTungsten University: Replicate Between MySQL And Oracle
Tungsten University: Replicate Between MySQL And Oracle
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 
Les défis des architectures cloud sur OpenStack
Les défis des architectures cloud sur OpenStackLes défis des architectures cloud sur OpenStack
Les défis des architectures cloud sur OpenStack
 
[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network Troubleshooting[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network Troubleshooting
 
Colvin RMAN New Features
Colvin RMAN New FeaturesColvin RMAN New Features
Colvin RMAN New Features
 
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNOne Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
 
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our ServersHTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
HTTP/2, HTTP/3 and SSL/TLS State of the Art in Our Servers
 
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And TricksCloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
Cloug Troubleshooting Oracle 11g Rac 101 Tips And Tricks
 
Oracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience SharingOracle 11G SCAN: Concepts and Implementation Experience Sharing
Oracle 11G SCAN: Concepts and Implementation Experience Sharing
 
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
[OpenInfra Days Korea 2018] (Track 3) - CephFS with OpenStack Manila based on...
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
Install Solaris 11.1 on a Virtualbox VM
Install Solaris 11.1 on a Virtualbox VMInstall Solaris 11.1 on a Virtualbox VM
Install Solaris 11.1 on a Virtualbox VM
 
Tungsten Replicator tutorial
Tungsten Replicator tutorialTungsten Replicator tutorial
Tungsten Replicator tutorial
 
Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016Introduction to Stacki at Atlanta Meetup February 2016
Introduction to Stacki at Atlanta Meetup February 2016
 
Setup oracle golden gate 11g replication
Setup oracle golden gate 11g replicationSetup oracle golden gate 11g replication
Setup oracle golden gate 11g replication
 

Ähnlich wie Dynamic Hadoop Clusters

Open Source XMPP for Cloud Services
Open Source XMPP for Cloud ServicesOpen Source XMPP for Cloud Services
Open Source XMPP for Cloud Servicesmattjive
 
A Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy SystemA Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy Systemadrian_nye
 
What is Digital Rebar Provision (and how RackN extends)?
What is Digital Rebar Provision (and how RackN extends)?What is Digital Rebar Provision (and how RackN extends)?
What is Digital Rebar Provision (and how RackN extends)?rhirschfeld
 
What_s_New_in_OpenShift_Container_Platform_4.6.pdf
What_s_New_in_OpenShift_Container_Platform_4.6.pdfWhat_s_New_in_OpenShift_Container_Platform_4.6.pdf
What_s_New_in_OpenShift_Container_Platform_4.6.pdfchalermpany
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapPadraig O'Sullivan
 
Red Hat OpenStack 17 저자직강+스터디그룹_4주차
Red Hat OpenStack 17 저자직강+스터디그룹_4주차Red Hat OpenStack 17 저자직강+스터디그룹_4주차
Red Hat OpenStack 17 저자직강+스터디그룹_4주차Nalee Jang
 
PaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpPaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpNathan Handler
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryoguest40fc7cd
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...InfluxData
 
Cloud init and cloud provisioning [openstack summit vancouver]
Cloud init and cloud provisioning [openstack summit vancouver]Cloud init and cloud provisioning [openstack summit vancouver]
Cloud init and cloud provisioning [openstack summit vancouver]Joshua Harlow
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Puppet
 
Introduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolIntroduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolSuresh Paulraj
 
Systemd for administrators
Systemd for administratorsSystemd for administrators
Systemd for administratorsSusant Sahani
 
Systemd for administrators
Systemd for administratorsSystemd for administrators
Systemd for administratorsSusant Sahani
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAkshaya Mahapatra
 
Linux Desktop Automation
Linux Desktop AutomationLinux Desktop Automation
Linux Desktop AutomationRui Lapa
 

Ähnlich wie Dynamic Hadoop Clusters (20)

Open Source XMPP for Cloud Services
Open Source XMPP for Cloud ServicesOpen Source XMPP for Cloud Services
Open Source XMPP for Cloud Services
 
A Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy SystemA Fabric/Puppet Build/Deploy System
A Fabric/Puppet Build/Deploy System
 
What is Digital Rebar Provision (and how RackN extends)?
What is Digital Rebar Provision (and how RackN extends)?What is Digital Rebar Provision (and how RackN extends)?
What is Digital Rebar Provision (and how RackN extends)?
 
What_s_New_in_OpenShift_Container_Platform_4.6.pdf
What_s_New_in_OpenShift_Container_Platform_4.6.pdfWhat_s_New_in_OpenShift_Container_Platform_4.6.pdf
What_s_New_in_OpenShift_Container_Platform_4.6.pdf
 
BPMS1
BPMS1BPMS1
BPMS1
 
BPMS1
BPMS1BPMS1
BPMS1
 
Monitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTapMonitoring MySQL with DTrace/SystemTap
Monitoring MySQL with DTrace/SystemTap
 
Red Hat OpenStack 17 저자직강+스터디그룹_4주차
Red Hat OpenStack 17 저자직강+스터디그룹_4주차Red Hat OpenStack 17 저자직강+스터디그룹_4주차
Red Hat OpenStack 17 저자직강+스터디그룹_4주차
 
PaaSTA: Running applications at Yelp
PaaSTA: Running applications at YelpPaaSTA: Running applications at Yelp
PaaSTA: Running applications at Yelp
 
Threading Successes 03 Gamebryo
Threading Successes 03   GamebryoThreading Successes 03   Gamebryo
Threading Successes 03 Gamebryo
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
 
Oracle API Gateway Installation
Oracle API Gateway InstallationOracle API Gateway Installation
Oracle API Gateway Installation
 
Cloud init and cloud provisioning [openstack summit vancouver]
Cloud init and cloud provisioning [openstack summit vancouver]Cloud init and cloud provisioning [openstack summit vancouver]
Cloud init and cloud provisioning [openstack summit vancouver]
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
Rapid Home Provisioning
Rapid Home ProvisioningRapid Home Provisioning
Rapid Home Provisioning
 
Introduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning ToolIntroduction to Stacki - World's fastest Linux server provisioning Tool
Introduction to Stacki - World's fastest Linux server provisioning Tool
 
Systemd for administrators
Systemd for administratorsSystemd for administrators
Systemd for administrators
 
Systemd for administrators
Systemd for administratorsSystemd for administrators
Systemd for administrators
 
Automating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps ApproachAutomating Software Development Life Cycle - A DevOps Approach
Automating Software Development Life Cycle - A DevOps Approach
 
Linux Desktop Automation
Linux Desktop AutomationLinux Desktop Automation
Linux Desktop Automation
 

Mehr von Steve Loughran

The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is overSteve Loughran
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)Steve Loughran
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionSteve Loughran
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!Steve Loughran
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()Steve Loughran
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming DeployedSteve Loughran
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?Steve Loughran
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveSteve Loughran
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupSteve Loughran
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionSteve Loughran
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateSteve Loughran
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARNSteve Loughran
 

Mehr von Steve Loughran (20)

Hadoop Vectored IO
Hadoop Vectored IOHadoop Vectored IO
Hadoop Vectored IO
 
The age of rename() is over
The age of rename() is overThe age of rename() is over
The age of rename() is over
 
What does Rename Do: (detailed version)
What does Rename Do: (detailed version)What does Rename Do: (detailed version)
What does Rename Do: (detailed version)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!@Dissidentbot: dissent will be automated!
@Dissidentbot: dissent will be automated!
 
PUT is the new rename()
PUT is the new rename()PUT is the new rename()
PUT is the new rename()
 
Extreme Programming Deployed
Extreme Programming DeployedExtreme Programming Deployed
Extreme Programming Deployed
 
Testing
TestingTesting
Testing
 
I hate mocking
I hate mockingI hate mocking
I hate mocking
 
What does rename() do?
What does rename() do?What does rename() do?
What does rename() do?
 
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and HiveDancing Elephants: Working with Object Storage in Apache Spark and Hive
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
 
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User GroupApache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Household INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony EraHousehold INFOSEC in a Post-Sony Era
Household INFOSEC in a Post-Sony Era
 
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 editionHadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
 
Hadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the GateHadoop and Kerberos: the Madness Beyond the Gate
Hadoop and Kerberos: the Madness Beyond the Gate
 
Slider: Applications on YARN
Slider: Applications on YARNSlider: Applications on YARN
Slider: Applications on YARN
 
YARN Services
YARN ServicesYARN Services
YARN Services
 
Datacentre stack
Datacentre stackDatacentre stack
Datacentre stack
 

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Dynamic Hadoop Clusters

  • 1. Dynamic Hadoop Clusters Steve Loughran Julio Guijarro http://wiki.smartfrog.org/wiki/display/sf/Dynamic+Hadoop+Clusters
  • 3. Hadoop on a cluster 27 April 2009 1 namenode, 1+ Job Tracker, many data nodes and task trackers
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. How do you configure Hadoop 0.21? 27 April 2009
  • 9.
  • 11. Configuration in RPMs 27 April 2009 +Push out, rollback via kickstart. - Extra build step, may need kickstart server
  • 12. clusterssh: cssh 27 April 2009 If all the machines start in the same initial state, they should end up in the same exit state
  • 13. CM-Managed Hadoop 27 April 2009 Resource Manager keeps cluster live; talks to infrastructure Persistent data store for input data and results
  • 14. Configuration Management tools 27 April 2009 CM tools are how to manage big clusters State Driven Workflow Centralized Radia, ITIL, lcfg Puppet Decentralized bcfg2, SmartFrog Perl scripts, makefiles
  • 15.
  • 16. Model the system in the SmartFrog language TwoNodeHDFS extends OneNodeHDFS { localDataDir2 extends TempDirWithCleanup { } datanode2 extends datanode { dataDirectories [LAZY localDataDir2]; dfs.datanode.https.address "https://0.0.0.0:8020"; } } 27 April 2009 Inheritance, cross-referencing, templating extending an existing template a temporary directory component extend and override with new values, including a reference to the temporary directory
  • 17. The runtime deploys the model 27 April 2009 Add better diagram
  • 19. HADOOP-3628: A lifecycle for services 27 April 2009
  • 20. Base Service class for all nodes public class Service extends Configured implements Closeable { public void start() throws IOException; public void innerPing(ServiceStatus status) throws IOException; void close() throws IOException; State getLifecycleState(); public enum State { UNDEFINED, CREATED, STARTED, LIVE, FAILED, CLOSED } } 27 April 2009
  • 21. Subclasses implement transitions public class NameNode extends Service implements ClientProtocol, NamenodeProtocol, ... { protected void innerStart() throws IOException { initialize(bindAddress, getConf()); setServiceState(ServiceState.LIVE); } public void innerClose() throws IOException { if (server != null) { server.stop(); server = null; } ... } } 27 April 2009
  • 22. Health and Liveness: ping() ‏ public class DataNode extends Service { ... public void innerPing(ServiceStatus status) throws IOException { if (ipcServer == null) { status.addThrowable( new LivenessException("No IPC Server running")); } if (dnRegistration == null) { status.addThrowable( new LivenessException("Not bound to a namenode")); } } 27 April 2009
  • 23.
  • 24. Replace hadoop-*.xml with .sf files NameNode extends FileSystemNode { nameDirectories TBD; dataDirectories TBD; logDir TBD; dfs.http.address "http://0.0.0.0:8021"; dfs.namenode.handler.count 10; dfs.namenode.decommission.interval (5 * 60); dfs.name.dir TBD; dfs.permissions.supergroup "supergroup"; dfs.upgrade.permission "0777" dfs.replication 3; dfs.replication.interval 3; . . . } 27 April 2009
  • 25. Hadoop Cluster under SmartFrog 27 April 2009
  • 26. Aggregated logs 17:39:08 [JobTracker] INFO mapred.ExtJobTracker : State change: JobTracker is now LIVE 17:39:08 [JobTracker] INFO mapred.JobTracker : Restoration complete 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting interTrackerServer 17:39:08 [IPC Server Responder] INFO ipc.Server : IPC Server Responder: starting 17:39:08 [IPC Server listener on 8012] INFO ipc.Server : IPC Server listener on 8012: starting 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting RUNNING 17:39:08 [Map-events fetcher for all reduce tasks on tracker_localhost:localhost/127.0.0.1:34072] INFO mapred.TaskTracker : Starting thread: Map-events fetcher for all reduce tasks on tracker_localhost:localhost/127.0.0.1:34072 17:39:08:960 GMT [INFO ][TaskTracker] HOST localhost:rootProcess:cluster - TaskTracker deployment complete: service is: tracker_localhost:localhost/127.0.0.1:34072 instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:08 [TaskTracker] INFO mapred.ExtTaskTracker : Task Tracker Service is being offered: tracker_localhost:localhost/127.0.0.1:34072 instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:09 [IPC Server handler 5 on 8012] INFO net.NetworkTopology : Adding a new node: /default-rack/localhost 17:39:09 [TaskTracker] INFO mapred.ExtTaskTracker : State change: TaskTracker is now LIVE 27 April 2009
  • 27. File and Job operations DFS manipulation: DfsCreateDir, DfsDeleteDir, DfsListDir, DfsPathExists, DfsFormatFileSystem, DFS I/O: DfsCopyFileIn, DfsCopyFileOut 27 April 2009 TestJob extends BlockingJobSubmitter { name "test-job"; cluster LAZY PARENT:cluster; jobTracker LAZY PARENT:cluster; mapred.child.java.opts "-Xmx512m"; mapred.tasktracker.map.tasks.maximum 5; mapred.tasktracker.reduce.tasks.maximum 1; mapred.map.max.attempts 1; mapred.reduce.max.attempts 1; }
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 35. XML in SCM-managed filesystem 27 April 2009 +push out, rollback. - Need to restart cluster, SPOF?
  • 36.
  • 37. Configuration with LDAP 27 April 2009 +View, change settings; High Availability - Not under SCM; rollback and preflight hard
  • 38.

Hinweis der Redaktion

  1. Some people say "when your data is in the cloud, you don't need to know where it is". We say, if you have the pager that goes off when the data isn't reachable, you do -or you need to know the phone number of someone else who does.
  2. This is what we are talking about: Hadoop on a cluster. Most of the stuff is disposable hardware, but not your namenode. The job tracker is also something you care about. The rest, task trackers can come and go (assuming they have no side effects), datanodes can only be disposed of if the data is replicated. The rate of failure will matter a lot.
  3. Name nodes are a point of failure. Look after them. Data nodes have to be viewed as transient. They come down, they come back up. If they aren't stable enough, you have problems as you wont ever get work done. Some organisations track hard disks. They matter more than CPUs, when you think about it. Keep track of every disk's history, its batch number, the #of power cycles of every machine it is in, etc, etc. etc. Run the log s through Hadoop later to look for correlations. Trusted machines: until Hadoop has real security, you have to trust everything on the net that can talk to the machines. Either bring them all up on a VPN (performance!) or a separate physical network
  4. We regularly see this discussed on the list -the problem of keeping /etc/hosts synchronised, pushing out site stuff. Don’t do it by hand. Do it once to know what the files are, then stop and look for a better way.
  5. This is how Yahoo! appear to run the largest Hadoop clusters in production. There's a good set of slides on this by Allen Wittenauer from ApacheCon 2008. LDAP would seem to add a SPOF, but you can have HA directory services. Also, if DNS is integrated with LDAP, the loss of DNS/LDAP is equally drastic, so availability is not made any worse than not using LDAP. You also need your own RPMs, as you will need to push out all binaries in RPM format. This often scares Java developers, but isn't that hard once you start on it. Hard to test though. Now, in this world, where do the hadoop XML files come from?
  6. This is my belief as to how Hadoop 0.21 gets configured. Belief. There may be some surprises hadoop-core.jar has XML files for hadoop's own configuration, properties files for Log4J and metrics core-default has the default values for hadoop's inner architecture; hdfs-default for the filesystem and mapred-default for JobTracker and TaskTracker. This is a precursor to splitting the projects in SVN, and the JARs, and replaces hadoop-default.xml. do not edit these files unless you know what you are doing and have a need to, and an understanding that your changes may be lost The -site files in conf/ provide overrides of these XML files -for all values not marked as final. hadoop-policy is for group access policy; there is no -default in the JAR. You only need this if you have access control turned on. There is a Log4J.properties file. To change it, you would need to rebuild Hadoop. As you can point to a different resource in hadoop-env.sh, this should not be needed. The hadoop-metrics file controls what metrics are collected. I'm not aware that it can be changed without a rebuild You may want a conf/slaves file to list allowed members of the network. This helps to secure an EC2 cluster, but stops you adding new nodes on demand There is the implicit configuration of the local machine, its classpath, the network, etc. These lead to support calls. Hadoop likes a stable network with DNS and reverse-DNS working, and machines that know their external IP address. When a job is submitted, the client-side configuration is bundled up and pushed out with the submission. The client-side submission must contain enough information to connect to the job tracker and filesystem. In the XML files, there are some ${} expressions evaluating JVM properties. They get interpreted in whichever machine reads the values from the Configuration object. As it moves round the network, the values may change. This may or may not be what you want.
  7. Cloudera are pushing out binaries and config as RPMs, with a web UI for creating new RPMs from your config -some preflight validation taking place at this point. This lets you push out new configurations via your RPM-management tooling. I'm not over-happy with relying on remotely generated RPMs for your config, because it complicates things. Its easier than using <rpmbuild> and rolling your own -I know that- but I like to have the config files as text under SCM; this removes that option. What it does do is make it easy to bring up an 18.x release of Hadoop. Ultimately Hadoop will end up having its own RPMs, or debs, as this product does only get installed on mass systems. Hopefully whatever RPMs people out together are compatible with this.
  8. This is a screenshot of cloudera RPMS on CentOS 5.2. The VM is confused about its hostname, a fact which is going to break Hadoop, but what is key is that the RPM installed itself and you can run namenodes, datanodes, trackers as services. Other problem: the RPMs hard code a Java dependency in there. This is tricky. I avoid that in my work RPMs as you have to choose how to install java, which is pretty tricky in RPMs. These RPMs demand the sun one, not openJDK or anyone elses
  9. This is the Cloudera strategy. Tools to create RPM Pro + Ant, maven, have (some) RPM build support. + SCM managed + quick rollback via alternatives . + no Cons -still need to trigger a service restart and reload of settings -ClearCase is a source of much stress and should be avoided by anyone who doesn't consider animal and human sacrifice part of the day-to-day operations process. - We'll have to watch what happens here; it really depends on the tooling to go around the RPM creation and serving up to the machines. A lot of the problems related to pushing out updates over a big cluster go away with virtualisation, when a single shared image is used by all the workers.
  10. If you are going to look after a set of machines with alternatives, learn to use tools like clusterssh. This will issue the same set of ssh commands to a set of machines, even a set listed in a text file. The big problem: the same command gets sent to every box. You had better have them all in sync, have scripts that look at the state of the machine before doing things.
  11. This is how you should view a CM-managed cluster, including a virtual one. In a "cloud" infrastructure the services in pink are used to work with machines
  12. CM tools -of any kind- are the only sensible way to manage big clusters. Even then, there are issues about which design is best, here are some of the different options. What I'm going to talk about is decentralized and mostly state driven, though it has workflows too State Driven : observe system state, push it back into the desired state Workflow : apply a sequence of operations to change a machine's state Centralized : central DB in charge Decentralized : machines look after themselves
  13. SmartFrog is a language for describing system configurations, a non-XML language with basic mathematics, cross references -including late-binding LAZY references to values not available until a system is deployed.
  14. Here is that model turned into an instantiation across machines and multiple JVMs ( forbetter isolation of the Hadoop processes and static variables) and
  15. Here is a demo
  16. This is what the current proposed lifecycle for hadoop services: namenode, datanodes, trackers, etc. Things you start and stop and wait for starting. You create something, start it, it then chooses to jitter between started and live unless it fails Service extends Closeable(), so Closeable.close() shuts it down. When a service fails, the underlying exception is retained for the benefit of management tools
  17. This is part of the base class. There's a lot more to it just to make it easy to subclass and move from state to state
  18. Its not enough to bring up a service, you want to see if it is alive. This is the current draft of a ping operation, one that adds health to the services. Services can add any number of exceptions to the status, so if something fails, you can see what's going on. This isn't ideal (exception construction costs high due to stack traces); I think we need to work on this via some stuff on a ThrowableWriteable which can be constructed from an exception, or just a stack trace sent over the ire
  19. This is where ping() gets complex. How do you define healthy in a cluster where the overall system availability is not 'all nodes are up' but 'we can get our work done', and that the failure of anything should be transient. I propose: push it up to the caller, report problems but don't declare failure in-node just because something remote is missing. That allows policy tools to handle it on a site-specific basis. For my clusters, if the namenode isn't up in 5 minutes, problems, kill that VM and start again. For Yahoo!, timeout is 45min and the action involves pagers.
  20. this is how we are working on configuration; using our own language for describing things. this is good: powerful, has maths and string operations, but does add more complexity to the system. All those cross references can hide problems, so you really have to use preflight checking to look at things. But at the same time I don't want to do lots of maintenance of the files, keeping up to date with the XML stuff. so we have the ability to read in a hadoop XML file, have its values turned into SmartFrog attributes, and then for other components to reference them. I'm still exploring what is the best way to work with hadoop configurations. One real problem is staying in sync with remote servers. I'd really like every remote node to be able to serve in request up the configuration file they used -no matter how that remote node was brought up.
  21. This is the deployed data node
  22. Alongside the actual cluster deployment is the stuff I have initially put together for testing; the ability to work with the filesystem, to copy data in and out and to submit work -blocking or non-blocking.
  23. This is where I am right now. I make progress, I get beaten back. I can bring up clusters in different configurations, check they are healthy, run tests. My basic MR runs are working, but some bits of Java security are interfereing with some of the JSP pages that I have added extra tests to.
  24. This is something I'd like to deal with, later: change how we configure Hadoop. It is getting complex. More importantly -the tactic of requiring end users to have the configuration only works when the end users are in sync with the cluster -As different options for configuring Hadoop emerge, can we come up with a proper interface for doing so?
  25. People ask about performance under VM tools. Here is the answer. Yes it is slower, especially to virtual disks that go through an extra layer of indirection before going physical. But for those people we are offering private hadoop clusters for, they are more grateful to have a cluster than its absolute performance.
  26. This is one that is keeping me somewhat busy right now, especially on VMWare Everything in the cluster needs to know the hostname in the fs.default.name URL, else they do not bind properly You don't know it, until the namenode knows who it is. In dynamic clusters, that isn't predictable In our networks, we've used anubis, an HA tuple-space, to find these things out; we can do that here, but it doesn't boot up in networks without multicast. Machines in virtual networks don't always have rDNS or good hostname setups. They do not always know their own name.
  27. Another option to consider: keeping all the conf files under SCM, and pushing them out via every node either mounting the same bit of disk, or by having a startup script to check out the values from a chosen SCM. Pro + all under SCM: diffs, rollbacks, &c + integrate with staging and release process + if you use ClearCase, wonderful things are possible + low cost, low tooling requirements Cons -still need to trigger a service restart and reload of settings -ClearCase is a source of much stress and should be avoided by anyone who doesn't consider animal and human sacrifice part of the day-to-day operations process. -
  28. One option -especially in datacentres like Amazon's- is to use keystore of some form to do the name->value mapping; for AWS you could use simpleDB as the store, or as a pointer to which s3: file contains the settings -the latter would be good for rollback This is a variant of the LDAP design; you need an HA database. It is OK if your app needs this database for other things (i.e. in the Java EE architecture), but in Hadoop you've just added more complexity and points of failure. Unless you can be sure that the database doesn't go down, it is another risk.
  29. I don't yet know of anyone who managed Hadoop via LDAP. How would you do it? -push the site settings out as (name,value) pairs. Have machines have the option of overriding it for per-machine, per-job by pointing them at machine-specific bits of the tree. Pros + integrates with rest of infrastructure + LDAP has lots of client libraries for cross-language work + High Availability + GUI lets you see what the machines are seeing + GUI lets you experiment with configuration, + No need to create new release artifacts to tune the system (ops like this)‏ Cons - rollback? - not under SCM; can't keep conf in sync with binaries. - you don't want to keep up to date with core-default.xml by hand.
  30. Pull this, replace with more live code