SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
Plugging the Holes:
Security and Compatibility
       Owen O’Malley
    Yahoo! Hadoop Team
    owen@yahoo-inc.com
Who Am I?

  •  Software Architect working on Hadoop since Jan 2006
     –  Before Hadoop worked on Yahoo Search’s WebMap
     –  My first patch on Hadoop was Nutch-197
     –  First Yahoo Hadoop committer
     –  Most prolific contributor to Hadoop (by patch count)
     –  Won the 2008 1TB and 2009 Minute and 100TB Sort
        Benchmarks
  •  Apache VP of Hadoop
     –  Chair of the Hadoop Project Management Committee
     –  Quarterly reports on the state of Hadoop for Apache Board



Hadoop World NYC - 2009
What are the Problems?

  •  Our shared clusters increase:
     –  Developer and operations productivity
     –  Hardware utilization
     –  Access to data
  •  Yahoo! wants to put customer and financial data on our
     Hadoop clusters.
     –  Great for providing access to all of the parts of Yahoo!
     –  Need to make sure that only the authorized people have
        access.
  •  Rolling out new versions of Hadoop is painful
     –  Clients need to change and recompile their code

Hadoop World NYC - 2009
Hadoop Security

  •  Currently, the Hadoop servers trust the users to declare
     who they are.
     –  It is very easy to spoof, especially with open source.
     –  For private clusters, we will leave non-security as option
  •  We need to ensure that users are who they claim to be.
  •  All access to HDFS (and therefore MapReduce) must
     be authenticated.
  •  The standard distributed authentication service is
     Kerberos (including ActiveDirectory).
  •  User code isn’t affected, since the security happens in
     the RPC layer.
Hadoop World NYC - 2009
HDFS Security

  •  Hadoop security is grounded in HDFS security.
     –  Other services such as MapReduce store their state in HDFS.
  •  Use of Kerberos allows a single sign on where the
     Hadoop commands pick up and use the user’s tickets.
  •  The framework authenticates the user to the Name
     Node using Kerberos before any operations.
  •  The Name Node is also authenticated to the user.
  •  Client can request an HDFS Access Token to get
     access later without going through Kerberos again.
     –  Prevents authorization storms as MapReduce jobs launch!


Hadoop World NYC - 2009
Accessing a File

  •  User uses Kerberos (or a HDFS Access Token) to
     authenticate to the Name Node.
  •  They request to open a file X.
  •  If they have permission to file X, the Name Node
     returns a token for reading the blocks of X.
  •  The user uses these tokens when communicating with
     the Data Nodes to show they have access.
  •  There are also tokens for writing blocks when the file is
     being created.



Hadoop World NYC - 2009
MapReduce Security

  •  Framework authenticates user to Job Tracker before
     they can submit, modify, or kill jobs.
  •  The Job Tracker authenticates itself to the user.
  •  Job’s logs (including stdout) are only visible to the user.
  •  Map and Reduce tasks actually run as the user.
  •  Tasks’ working directories are protected from others.
  •  The Job Tracker’s system directory is no longer
     readable and writable by everyone.
  •  Only the reduce tasks can get the map outputs.


Hadoop World NYC - 2009
Interactions with HDFS

  •  MapReduce jobs need to read and write HDFS files as
     the user.
  •  Currently, we store the user name in the job.
  •  With security enabled, we will store HDFS Access
     Tokens in the job.
  •  The job needs a token for each HDFS cluster.
  •  The tokens will be renewed by the Job Tracker so they
     don’t expire for long running jobs.
  •  When the job completes, the tokens will be cancelled.



Hadoop World NYC - 2009
Interactions with Higher Layers

  •  Yahoo uses a workflow manager named Oozie to
     submits MapReduce jobs on behalf of the user.
  •  We could store the user’s credentials with a modifier
     (oom/oozie) in Oozie to access Hadoop as the user.
  •  Or we could create Token granting Tokens for HDFS
     and MapReduce and store those in Oozie.
  •  In either case, such proxies are a potential source of
     security problems, since they are storing large number
     of user’s access credentials.



Hadoop World NYC - 2009
Web UIs

  •  Hadoop and especially MapReduce make heavy use of
     the Web Uis.
  •  These need to be authenticated also…
  •  Fortunately, there is a standard solution for Kerberos
     and HTTP, named SPNEGO.
  •  SPNEGO is supported by all of the major browsers.
  •  All of the servlets will use SPNEGO to authenticate the
     user and enforce permissions appropriately.




Hadoop World NYC - 2009
Remaining Security Issues

  •  We are not encrypting on the wire.
     –  It will be possible within the framework, but not in 0.22.
  •  We are not encrypting on disk.
     –  For either HDFS or MapReduce.
  •  Encryption is expensive in terms of CPU and IO speed.
  •  Our current threat model is that the attacker has access
     to a user account, but not root.
     –  They can’t sniff the packets on the network.




Hadoop World NYC - 2009
Backwards Compatibility

  •  API
  •  Protocols
  •  File Formats
  •  Configuration




Hadoop World NYC - 2009
API Compatibility

  •  Need to mark APIs with
     –  Audience: Public, Limited Private, Private
     –  Stability: Stable, Evolving, Unstable
     @InterfaceAudience.Public
     @InterfaceStability.Stable
     public class Xxxx {…}
     –  Developers need to ensure that 0.22 is backwards compatible
        with 0.21
  •  Defined new APIs designed to be future-proof:
     –  MapReduce – Context objects in org.apache.hadoop.mapreduce
     –  HDFS – FileContext in org.apache.hadoop.fs

Hadoop World NYC - 2009
Protocol Compatibility

  •  Currently all clients of a server must be the same
     version (0.18, 0.19, 0.20, 0.21).
  •  Want to enable forward and backward compatibility
  •  Started work on Avro
     –  Includes the schema of the information as well as the data
     –  Can support different schemas on the client and server
     –  Still need to make the code tolerant of version differences
     –  Avro provides the mechanisms
  •  Avro will be used for file version compatibility too



Hadoop World NYC - 2009
Configuration

  •  Configuration in Hadoop is a string to string map.
  •  Maintaining backwards compatibility of configuration
     knobs was done case by case.
  •  Now we have standard infrastructure for declaring old
     knobs deprecated.
  •  Also have cleaned up a lot of the names in 0.21.




Hadoop World NYC - 2009
Questions?

  •  Thanks for coming!
  •  Mailing lists:
     –  common-user@hadoop.apache.org
     –  hdfs-user@hadoop.apache.org
     –  mapreduce-user@hadoop.apache.org
  •  Slides posted on the Hadoop wiki page
     –  http://wiki.apache.org/hadoop/HadoopPresentations




Hadoop World NYC - 2009

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
DataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

Was ist angesagt? (19)

Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Hadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117revHadoop security overview_hit2012_1117rev
Hadoop security overview_hit2012_1117rev
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Hello OpenStack, Meet Hadoop
Hello OpenStack, Meet HadoopHello OpenStack, Meet Hadoop
Hello OpenStack, Meet Hadoop
 

Andere mochten auch

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 
4 System For Information Security
4 System For Information Security4 System For Information Security
4 System For Information Security
Ana Meskovska
 
Information System Security(lecture 1)
Information System Security(lecture 1)Information System Security(lecture 1)
Information System Security(lecture 1)
Ali Habeeb
 

Andere mochten auch (20)

Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC Files
ORC FilesORC Files
ORC Files
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
4 System For Information Security
4 System For Information Security4 System For Information Security
4 System For Information Security
 
Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
 
Information System Security(lecture 1)
Information System Security(lecture 1)Information System Security(lecture 1)
Information System Security(lecture 1)
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...
 

Ähnlich wie Plugging the Holes: Security and Compatability in Hadoop

Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
 

Ähnlich wie Plugging the Holes: Security and Compatability in Hadoop (20)

Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hive
HiveHive
Hive
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 

Mehr von Owen O'Malley

Mehr von Owen O'Malley (9)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
 

Kürzlich hochgeladen

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Plugging the Holes: Security and Compatability in Hadoop

  • 1. Plugging the Holes: Security and Compatibility Owen O’Malley Yahoo! Hadoop Team owen@yahoo-inc.com
  • 2. Who Am I? •  Software Architect working on Hadoop since Jan 2006 –  Before Hadoop worked on Yahoo Search’s WebMap –  My first patch on Hadoop was Nutch-197 –  First Yahoo Hadoop committer –  Most prolific contributor to Hadoop (by patch count) –  Won the 2008 1TB and 2009 Minute and 100TB Sort Benchmarks •  Apache VP of Hadoop –  Chair of the Hadoop Project Management Committee –  Quarterly reports on the state of Hadoop for Apache Board Hadoop World NYC - 2009
  • 3. What are the Problems? •  Our shared clusters increase: –  Developer and operations productivity –  Hardware utilization –  Access to data •  Yahoo! wants to put customer and financial data on our Hadoop clusters. –  Great for providing access to all of the parts of Yahoo! –  Need to make sure that only the authorized people have access. •  Rolling out new versions of Hadoop is painful –  Clients need to change and recompile their code Hadoop World NYC - 2009
  • 4. Hadoop Security •  Currently, the Hadoop servers trust the users to declare who they are. –  It is very easy to spoof, especially with open source. –  For private clusters, we will leave non-security as option •  We need to ensure that users are who they claim to be. •  All access to HDFS (and therefore MapReduce) must be authenticated. •  The standard distributed authentication service is Kerberos (including ActiveDirectory). •  User code isn’t affected, since the security happens in the RPC layer. Hadoop World NYC - 2009
  • 5. HDFS Security •  Hadoop security is grounded in HDFS security. –  Other services such as MapReduce store their state in HDFS. •  Use of Kerberos allows a single sign on where the Hadoop commands pick up and use the user’s tickets. •  The framework authenticates the user to the Name Node using Kerberos before any operations. •  The Name Node is also authenticated to the user. •  Client can request an HDFS Access Token to get access later without going through Kerberos again. –  Prevents authorization storms as MapReduce jobs launch! Hadoop World NYC - 2009
  • 6. Accessing a File •  User uses Kerberos (or a HDFS Access Token) to authenticate to the Name Node. •  They request to open a file X. •  If they have permission to file X, the Name Node returns a token for reading the blocks of X. •  The user uses these tokens when communicating with the Data Nodes to show they have access. •  There are also tokens for writing blocks when the file is being created. Hadoop World NYC - 2009
  • 7. MapReduce Security •  Framework authenticates user to Job Tracker before they can submit, modify, or kill jobs. •  The Job Tracker authenticates itself to the user. •  Job’s logs (including stdout) are only visible to the user. •  Map and Reduce tasks actually run as the user. •  Tasks’ working directories are protected from others. •  The Job Tracker’s system directory is no longer readable and writable by everyone. •  Only the reduce tasks can get the map outputs. Hadoop World NYC - 2009
  • 8. Interactions with HDFS •  MapReduce jobs need to read and write HDFS files as the user. •  Currently, we store the user name in the job. •  With security enabled, we will store HDFS Access Tokens in the job. •  The job needs a token for each HDFS cluster. •  The tokens will be renewed by the Job Tracker so they don’t expire for long running jobs. •  When the job completes, the tokens will be cancelled. Hadoop World NYC - 2009
  • 9. Interactions with Higher Layers •  Yahoo uses a workflow manager named Oozie to submits MapReduce jobs on behalf of the user. •  We could store the user’s credentials with a modifier (oom/oozie) in Oozie to access Hadoop as the user. •  Or we could create Token granting Tokens for HDFS and MapReduce and store those in Oozie. •  In either case, such proxies are a potential source of security problems, since they are storing large number of user’s access credentials. Hadoop World NYC - 2009
  • 10. Web UIs •  Hadoop and especially MapReduce make heavy use of the Web Uis. •  These need to be authenticated also… •  Fortunately, there is a standard solution for Kerberos and HTTP, named SPNEGO. •  SPNEGO is supported by all of the major browsers. •  All of the servlets will use SPNEGO to authenticate the user and enforce permissions appropriately. Hadoop World NYC - 2009
  • 11. Remaining Security Issues •  We are not encrypting on the wire. –  It will be possible within the framework, but not in 0.22. •  We are not encrypting on disk. –  For either HDFS or MapReduce. •  Encryption is expensive in terms of CPU and IO speed. •  Our current threat model is that the attacker has access to a user account, but not root. –  They can’t sniff the packets on the network. Hadoop World NYC - 2009
  • 12. Backwards Compatibility •  API •  Protocols •  File Formats •  Configuration Hadoop World NYC - 2009
  • 13. API Compatibility •  Need to mark APIs with –  Audience: Public, Limited Private, Private –  Stability: Stable, Evolving, Unstable @InterfaceAudience.Public @InterfaceStability.Stable public class Xxxx {…} –  Developers need to ensure that 0.22 is backwards compatible with 0.21 •  Defined new APIs designed to be future-proof: –  MapReduce – Context objects in org.apache.hadoop.mapreduce –  HDFS – FileContext in org.apache.hadoop.fs Hadoop World NYC - 2009
  • 14. Protocol Compatibility •  Currently all clients of a server must be the same version (0.18, 0.19, 0.20, 0.21). •  Want to enable forward and backward compatibility •  Started work on Avro –  Includes the schema of the information as well as the data –  Can support different schemas on the client and server –  Still need to make the code tolerant of version differences –  Avro provides the mechanisms •  Avro will be used for file version compatibility too Hadoop World NYC - 2009
  • 15. Configuration •  Configuration in Hadoop is a string to string map. •  Maintaining backwards compatibility of configuration knobs was done case by case. •  Now we have standard infrastructure for declaring old knobs deprecated. •  Also have cleaned up a lot of the names in 0.21. Hadoop World NYC - 2009
  • 16. Questions? •  Thanks for coming! •  Mailing lists: –  common-user@hadoop.apache.org –  hdfs-user@hadoop.apache.org –  mapreduce-user@hadoop.apache.org •  Slides posted on the Hadoop wiki page –  http://wiki.apache.org/hadoop/HadoopPresentations Hadoop World NYC - 2009