SlideShare ist ein Scribd-Unternehmen logo
1 von 21
PRODUCTIONIZING HADOOP
New Lessons Learned
Eric Sammer
General Announcements

• All lines are muted
• Ask questions any time using the “questions”
  pane on your GoToWebinar panel
• Recording of this webinar will be available on
  demand at www.cloudera.com
The Universe of Operations


System Operations       Architecture and App Ops
• Server and network    • Data architecture
• Operating system      • Data integration
• Identity and access   • Data quality monitoring
• Resource management   • Resource management
• Maintenance           • Pipeline maintenance
• Cluster monitoring    • Governance
• Backup and DR
Scope for Today

• A focus on common stumbling blocks
   • Workload-oriented planning and identification
   • Network architecture
   • Host management
   • Configuration management
   • Identity, Access, and Authorization
   • Cluster and resource sharing
• Time for questions
Proper Planning

• Develop an understanding of your use cases
   • What you (will) do defines what you need
   • Analog: OLTP RDBMS versus OLAP
• Prototype if necessary
Understanding Cluster Usage
…by use case



           Data Mining / IR


                                             ETL


                              Report Generation

         Analytics
…by use case



                      Data Mining / IR
    Network utilization is a
    function of job size, its
    profile, and the number                             ETL
    of concurrent jobs
                                         Report Generation

                  Analytics
Network Architecture

• Your current architecture is probably fine
   • Typical: traditional L2 tree (fine for North/South)
   • Emerging: L3 spine/leaf (optimized for East/West)
• Minimize oversubscription (normal: 1:1.2)
• Deep port buffers       (with fair allocation for shared memory)

• Do not collocate low-latency apps with MR
• Monitor, monitor, monitor
   • Bandwidth, buffer, packet count, and size deciles
Host Configuration

• OS version and patches
• Java 6   (HotSpot VM)

• PAM limits    (nofile, nproc)

• Naming    (nsswitch.conf, resolv.conf, hosts, gethostname())

• OS filesystem selection and tuning
• Time service
• Users, groups, and identity management
• Machines should not be unique snowflakes
Configuration Management

• Puppet/Chef/<your favorite> for OS config
   • Package installation
   • Identity and authorization wiring
• Cloudera Manager for platform management
   • Deployment and configuration
   • Service lifecycle
   • Platform-specific service monitoring and diagnostics
   • Activity monitoring
• Complementary systems
   • Differentiating factors: centralized
     coordination, service awareness, orchestration
Identity, Access, and Authorization

• MapReduce is a code execution engine
• Identity management and access control is hard
  (in distributed systems like Hadoop)
• Hadoop uses the OS (or Kerberos) for identity
   • Lots of entry points
   • Comparatively low level
• Access control is a function of each service
   • HDFS: Unix-style octal permissions on objects
   • MapReduce: ACLs on job queues
Resource Sharing

• One cluster, many groups
• Pros
   • Benefit from aggregate resources
   • Greater utilization
   • Reduced cap/op-ex
Resource Sharing
• Three dimensions of sharing a cluster
    • Collocation of services (e.g. MapReduce and HBase)
    • Collocation of groups of users
    • Collocation of workload profiles (ETL, analytics)
• In an ideal world, collocate all and enforce policy
    • Not currently possible
• Problems
    • System utilization varies wildly
    • Fair distribution of shared resources
    • Increased access control complexity
    • SLA of most sensitive group applies to all
    • …but nothing new
Resource Sharing
• Reasons to collocate groups / applications:
   • Similar system utilization profiles
   • Time-based utilization (e.g. daily ETL and office hour
      analytics)
   • Maintain similar SLAs
   • Extensively data sharing
   • When it’s trivially easy with current control mechanisms
• Reasons to segregate groups / applications:
   • Compliance, regulation, or where security is paramount
   • Wildly dissimilar utilization profiles (notably HBase and
      MapReduce)
• A significant area of interest for Cloudera
Now What?

• There’s a lot (more) to think about
• We can help
   • Education
   • Services
   • Software
   • Support
• Strata + Hadoop World 2012
• Look for upcoming webinars
Questions?
Type them in the “Questions” panel.

Congratulations to the winners
of the book drawing!
• Vani Mahobia
• Ken Gayler
• Richard Zhang
• Anand Rajan
• Erica Muxlow
Questions?
Type them in the “Questions” panel.



To learn more about Hadoop
Operations, A Guide for
Developers and
Administrators, or about the
spotted cavy, go to
www.oreilly.com
THANK YOU!
Eric Sammer, Principal Solutions Architect
@esammer
For more information: www.cloudera.com
Sales: (888)789-1488
@cloudera
Hardware Planning

• CPU
• Disk capacity and configuration
• Spindle count
• Memory (amount and configuration)
• NIC configuration
• Hadoop’s hardware preferences tend to be
 controversial until the architecture is understood
Baseline Hardware

• Disk
   • SATA II 7200RPM (SAS controller)
   • JBOD (OS on R1)
   • Option 1: 12x3.5” LFF 3TB
   • Option 2: 24x2.5” SFF 1TB
   • Option: MDL/NL SAS drives
• 2x2.2Ghz 6C 20MB cache
• 48GB+ DDR3-1600 ECC
• 1GbE vs. 10GbE
   • Is there new info here?

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Data engineering
Data engineeringData engineering
Data engineering
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Learning UML with Enterprise Architect
Learning UML with Enterprise ArchitectLearning UML with Enterprise Architect
Learning UML with Enterprise Architect
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Operationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud FoundryOperationalizing Data Science Using Cloud Foundry
Operationalizing Data Science Using Cloud Foundry
 
Accelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & PrivaceraAccelerate Data Science Initiatives: Databricks & Privacera
Accelerate Data Science Initiatives: Databricks & Privacera
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
Ankus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration frameworkAnkus, bigdata deployment and orchestration framework
Ankus, bigdata deployment and orchestration framework
 

Ähnlich wie Productionizing Hadoop - New Lessons Learned

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
confluent
 

Ähnlich wie Productionizing Hadoop - New Lessons Learned (20)

Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
 
Survey of Big Data Infrastructures
Survey of Big Data InfrastructuresSurvey of Big Data Infrastructures
Survey of Big Data Infrastructures
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Hadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_whichHadoop and IDW - When_to_use_which
Hadoop and IDW - When_to_use_which
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
Relational Database Stockholm Syndrome (Neal Murray, 6 Point 6) London 2019 C...
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
Implementing Private Database Clouds
Implementing Private Database CloudsImplementing Private Database Clouds
Implementing Private Database Clouds
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overview
 

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Productionizing Hadoop - New Lessons Learned

  • 2. General Announcements • All lines are muted • Ask questions any time using the “questions” pane on your GoToWebinar panel • Recording of this webinar will be available on demand at www.cloudera.com
  • 3. The Universe of Operations System Operations Architecture and App Ops • Server and network • Data architecture • Operating system • Data integration • Identity and access • Data quality monitoring • Resource management • Resource management • Maintenance • Pipeline maintenance • Cluster monitoring • Governance • Backup and DR
  • 4. Scope for Today • A focus on common stumbling blocks • Workload-oriented planning and identification • Network architecture • Host management • Configuration management • Identity, Access, and Authorization • Cluster and resource sharing • Time for questions
  • 5. Proper Planning • Develop an understanding of your use cases • What you (will) do defines what you need • Analog: OLTP RDBMS versus OLAP • Prototype if necessary
  • 7. …by use case Data Mining / IR ETL Report Generation Analytics
  • 8. …by use case Data Mining / IR Network utilization is a function of job size, its profile, and the number ETL of concurrent jobs Report Generation Analytics
  • 9. Network Architecture • Your current architecture is probably fine • Typical: traditional L2 tree (fine for North/South) • Emerging: L3 spine/leaf (optimized for East/West) • Minimize oversubscription (normal: 1:1.2) • Deep port buffers (with fair allocation for shared memory) • Do not collocate low-latency apps with MR • Monitor, monitor, monitor • Bandwidth, buffer, packet count, and size deciles
  • 10. Host Configuration • OS version and patches • Java 6 (HotSpot VM) • PAM limits (nofile, nproc) • Naming (nsswitch.conf, resolv.conf, hosts, gethostname()) • OS filesystem selection and tuning • Time service • Users, groups, and identity management • Machines should not be unique snowflakes
  • 11. Configuration Management • Puppet/Chef/<your favorite> for OS config • Package installation • Identity and authorization wiring • Cloudera Manager for platform management • Deployment and configuration • Service lifecycle • Platform-specific service monitoring and diagnostics • Activity monitoring • Complementary systems • Differentiating factors: centralized coordination, service awareness, orchestration
  • 12. Identity, Access, and Authorization • MapReduce is a code execution engine • Identity management and access control is hard (in distributed systems like Hadoop) • Hadoop uses the OS (or Kerberos) for identity • Lots of entry points • Comparatively low level • Access control is a function of each service • HDFS: Unix-style octal permissions on objects • MapReduce: ACLs on job queues
  • 13. Resource Sharing • One cluster, many groups • Pros • Benefit from aggregate resources • Greater utilization • Reduced cap/op-ex
  • 14. Resource Sharing • Three dimensions of sharing a cluster • Collocation of services (e.g. MapReduce and HBase) • Collocation of groups of users • Collocation of workload profiles (ETL, analytics) • In an ideal world, collocate all and enforce policy • Not currently possible • Problems • System utilization varies wildly • Fair distribution of shared resources • Increased access control complexity • SLA of most sensitive group applies to all • …but nothing new
  • 15. Resource Sharing • Reasons to collocate groups / applications: • Similar system utilization profiles • Time-based utilization (e.g. daily ETL and office hour analytics) • Maintain similar SLAs • Extensively data sharing • When it’s trivially easy with current control mechanisms • Reasons to segregate groups / applications: • Compliance, regulation, or where security is paramount • Wildly dissimilar utilization profiles (notably HBase and MapReduce) • A significant area of interest for Cloudera
  • 16. Now What? • There’s a lot (more) to think about • We can help • Education • Services • Software • Support • Strata + Hadoop World 2012 • Look for upcoming webinars
  • 17. Questions? Type them in the “Questions” panel. Congratulations to the winners of the book drawing! • Vani Mahobia • Ken Gayler • Richard Zhang • Anand Rajan • Erica Muxlow
  • 18. Questions? Type them in the “Questions” panel. To learn more about Hadoop Operations, A Guide for Developers and Administrators, or about the spotted cavy, go to www.oreilly.com
  • 19. THANK YOU! Eric Sammer, Principal Solutions Architect @esammer For more information: www.cloudera.com Sales: (888)789-1488 @cloudera
  • 20. Hardware Planning • CPU • Disk capacity and configuration • Spindle count • Memory (amount and configuration) • NIC configuration • Hadoop’s hardware preferences tend to be controversial until the architecture is understood
  • 21. Baseline Hardware • Disk • SATA II 7200RPM (SAS controller) • JBOD (OS on R1) • Option 1: 12x3.5” LFF 3TB • Option 2: 24x2.5” SFF 1TB • Option: MDL/NL SAS drives • 2x2.2Ghz 6C 20MB cache • 48GB+ DDR3-1600 ECC • 1GbE vs. 10GbE • Is there new info here?

Hinweis der Redaktion

  1. INTERNAL NOTES – DELETE BEFORE POSTING!Set expectation that this is targeted to relatively beginner audience?What’s new? What are the NEW lessons learned? Example war story to start it off would help audience get into it.Scope? Core Hadoop (MR &amp; HDFS) vs. the entire CDH stack (Hive, ZK, HBase, etc.) and how do they co-locate deployment-wise. i.e. Do I need separate HW to run other components?(MapR depositioning): Mention: HA, performance, DR, data integrity, federation, MR2,
  2. SCRIPT for Zoo/Moderator (go through this as quickly as you can)Before we get started I’d like to let you know thatAll lines are mutedAsk questions any time by typing them into the QUESTIONS pane on your GoToWebinar panelThis webinar is being recorded and will be available later at cloudera.comLet me pass you to Eric Sammer, who is a Principal Solutions Architect and Cloudera and author of the recently published book “Hadoop Operations” by O’Reilly Media.
  3. - Do I need to dedicated rack/network for Hadoop? Or can I run other apps services running on same rack/network?
  4. Why not use Puppet/Chef for Hadoop config as well? Why is CM better? If I use Puppet/Chef for ALL my config mgmt (systems &amp; apps), why point solution CM for Hadoop?
  5. SCRIPT Zoo/moderator (speak fast):Thank you Eric. Let’s now move quickly into the Q&amp;A portion of this webinar. Please type your questions into the QUESTIONS PANEL and we’ll get to as many questions as we have time for. While Eric is reviewing the questions I’d like to congratulate the winners of the book drawing. If you see your name listed here your book will be mailed to you by the last week of October. It’s being printed now so when you receive it it’ll be “hot off the press”.MOVE TO NEXT SLIDE – get winners’ names off the screen
  6. SCRIPT Zoo/moderator (speak fast):Eric, are you ready to answer some questions?MOVE TO THANK YOU SLIDE WHILE CLOSING