SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Configuring a 
secure, multitenant 
cluster for the 
enterprise 
James Kinley // Principal Solutions Architect
© 2014 Cloudera, Inc. All rights reserved. 2 
About me 
• James Kinley 
• Principal Solutions Architect, EMEA 
• Hadoop user since 2010 
• Clouderan since 2012 
• Background in UK defence industry and cyber security 
• github.com/jrkinley 
• jameskinley.tumblr.com 
• @jrkinley 
• uk.linkedin.com/in/jameskinley
© 2014 Cloudera, Inc. All rights reserved. 3 
Introduction: Data Hub Objectives 
• Sharing Data 
better insight 
• Sharing Compute 
better utilisation and performance 
• Consolidated Operations 
reduced cost and complexity
Multitenancy in Hadoop refers to a set of 
features that enable multiple groups from 
within the same organisation to share the 
common set of resources in a cluster without 
negatively impacting service-levels, violating 
security constraints, or even revealing the 
existence of each other, all via policy rather 
than physical separation. 
© 2014 Cloudera and/or its affiliates. All rights reserved. 4
© 2014 Cloudera, Inc. All rights reserved. 5 
Multitenant Cluster Architecture 
• Security & Governance 
• HDFS Information Architecture 
(IA) 
• Authentication 
• Authorisation 
• Auditing 
• Quota management 
• Resource Isolation & 
Management 
• Static partitioning 
• Dynamic partitioning 
• Impala admission control 
PARTNER LOGO
© 2014 Cloudera, Inc. All rights reserved. 6 
Security & Governance 
• HDFS Information Architecture: file and directory structure 
• Authentication: proves users are who they say they are 
[Kerberos, Identity Management (LDAP)] 
• Authorisation: determines what users can see and do 
[HDFS Permissions, RBAC (Apache Sentry), Encryption] 
• Auditing: determines who did what, and when 
[Cloudera Navigator]
© 2014 Cloudera, Inc. All rights reserved. 7 
Security & Governance 
• HDFS Information Architecture (IA) 
drwxr-x---+ tadmin tgroup /users/{tenantId} 
drwxr-x--- tadmin tgroup /users/{tenantId}/archive 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} 
drwxr-x---+ tadmin tgroup /users/{tenantId}/landing 
drwxrwx--- tadmin tgroup /users/{tenantId}/processing 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
© 2014 Cloudera, Inc. All rights reserved. 8 
Security & Governance 
• Authentication: Kerberos & LDAP 
drwxr-x---+ tadmin tgroup /users/{tenantId} 
drwxr-x--- tadmin tgroup /users/{tenantId}/archive 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} 
drwxr-x---+ tadmin tgroup /users/{tenantId}/landing 
drwxrwx--- tadmin tgroup /users/{tenantId}/processing 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
© 2014 Cloudera, Inc. All rights reserved. 9 
Security & Governance 
• Authorisation: HDFS permissions 
drwxr-x---+ tadmin tgroup /users/{tenantId} 
drwxr-x--- tadmin tgroup /users/{tenantId}/archive 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} 
drwxr-x---+ tadmin tgroup /users/{tenantId}/landing 
drwxrwx--- tadmin tgroup /users/{tenantId}/processing 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
© 2014 Cloudera, Inc. All rights reserved. 10 
Security & Governance 
• Authorisation: HDFS extended ACLs 
drwxr-x---+ tadmin tgroup /users/{tenantId} 
drwxr-x--- tadmin tgroup /users/{tenantId}/archive 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} 
drwxr-x---+ tadmin tgroup /users/{tenantId}/landing 
drwxrwx--- tadmin tgroup /users/{tenantId}/processing 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} 
drwxr-x--- {tuser} tgroup Give /users/{“tingest” tenantId}/user permission processing/{over jobId}/the landing input 
directory: 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output 
$ hdfs dfs -setfacl -m user:tingest:rwx /users/{tenantId}/landing 
Give “hive” group permission over the landing directory: 
$ hdfs dfs -setfacl –m group:hive:rwx /users/{tenantId}/landing
© 2014 Cloudera, Inc. All rights reserved. 11 
Security & Governance 
• Authorisation: Apache Sentry (incubating) 
• Fine-grained, role-based access control (RBAC) 
• Users can see only the data and metadata to which they have been granted 
the privilege 
• Currently works with Apache Hive, Cloudera Impala, and Cloudera Search 
• File or Service (GRANT/REVOKE) based policy providers 
• Role-based privilege model 
• {user} > {groups} > {roles} > object > privilege 
• object = {server, database, table, URI} 
• privilege = {select, insert, all} 
• Supports grant permission delegation for multitenant clusters
© 2014 Cloudera, Inc. All rights reserved. 12 
Security & Governance 
• Authorisation: Apache Sentry (incubating) 
drwxr-x---+ tadmin tgroup /users/{tenantId} 
drwxr-x--- tadmin tgroup /users/{tenantId}/archive 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} 
drwxr-x---+ tadmin tgroup /users/{tenantId}/landing 
drwxrwx--- tadmin tgroup /users/{tenantId}/processing 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input 
drwxr-x--- {tuser} tgrouDp ele/ugsaetres g/r{atnetn aanntdI dre}v/oprkoec pesrisviilnegg/e{ jtoo btIedn}a/notu’st paudtmin role: 
> GRANT ALL ON DATABASE {db} TO ROLE {tadmin} WITH GRANT OPTION;
© 2014 Cloudera, Inc. All rights reserved. 13 
Security & Governance 
• Authorisation: Encryption 
• Network encryption (HDFS and MR) 
• At-rest encryption for HDFS 
• Cloudera Navigator Encrypt & KeyTrustee (Gazzang) 
• Project Rhino (Cloudera + Intel) 
• HDFS-level encryption (HDFS-6134 + HADOOP-10150) 
• Encryption zones (HDFS-6386) 
• Hardware-accelerated (HADOOP-10693)
© 2014 Cloudera, Inc. All rights reserved. 14 
Security & Governance 
• Authorisation: HDFS encryption zone 
drwxr-x---+ tadmin tgroup /users/{tenantId} 
drwxr-x--- tadmin tgroup /users/{tenantId}/archive 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} 
drwxr-x---+ tadmin tgroup /users/{tenantId}/landing 
drwxrwx--- tadmin tgroup /users/{tenantId}/processing 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
© 2014 Cloudera, Inc. All rights reserved. 15 
Security & Governance 
• Governance: HDFS disk quota management 
• Restrict tenants use of storage 
• Prevents misuse of the shared filesystem 
• HDFS supports two quota mechanisms 
• Disk space quotas 
• Name quotas
© 2014 Cloudera, Inc. All rights reserved. 16 
Security & Governance 
• Governance: HDFS disk quota management 
drwxr-x---+ tadmin tgroup /users/{tenantId} 
drwxr-x--- tadmin tgroup /users/{tenantId}/archive 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse 
drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} 
drwxr-x---+ tadmin tgroup /users/{tenantId}/landing 
drwxrwx--- tadmin tgroup /users/{tenantId}/processing 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input 
drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
© 2014 Cloudera, Inc. All rights reserved. 17 
Resource Isolation & Management 
• Dividing up finite cluster resource to ensure predictable 
behaviour 
• Goals: 
• Guarantee service levels for critical workflows 
• Support fair allocation of resources between different groups of 
users 
• Prevent users from depriving other users access to the cluster
© 2014 Cloudera, Inc. All rights reserved. 18 
Resource Isolation & Management 
• Static partitioning 
• Static service pools 
• Statically partition resource for HBase, HDFS, Impala, Search, and 
YARN 
• Enforced by Linux cgroups
© 2014 Cloudera, Inc. All rights reserved. 19 
Resource Isolation & Management 
• Dynamic partitioning 
• Dynamic resource pools 
• Dynamically apportion resource [statically] allocated to Impala and 
YARN 
• Named pool of resource + scheduling policy 
• Resource allocation based on weight 
• User to pool placement policy 
• ACLs 
• SLOs (use of pre-emption)
© 2014 Cloudera, Inc. All rights reserved. 20 
Resource Isolation & Management 
• Impala admission control 
• Limits concurrent queries and memory usage 
• Additional queries are queued 
• Configured per pool 
• max_requests 
• mem_limit 
• max_queued 
• Avoids resource oversubscription (OOM) during heavy usage 
• Pool placement policy mechanism same as YARN RM 
• Use with static partitioning (independently from YARN) 
• Or integrate with YARN for resource management via Llama
© 2014 Cloudera, Inc. All rights reserved. 21 
Resource Isolation & Management 
• Classification 
• User to pool placement rules 
• Based on user, group, or specified tag: 
MR: mapreduce.job.queuename 
Impala: REQUEST_POOL
© 2014 Cloudera, Inc. All rights reserved. 22 
Resource Isolation & Management 
• Queues 
• YARN 
• Max running apps 
• Max memory 
• Max vcores 
• Impala admission control 
• Max running queries 
• Max memory 
• Max queue size
© 2014 Cloudera, Inc. All rights reserved. 23 
Resource Isolation & Management 
• Dynamic resource pools 
• Scheduling policy 
• Dominant Resource Fairness (DRF) 
• Fair Scheduler (FAIR) 
• First-in, First-out (FIFO) 
• Recommendations: 
• Disable undeclared pools 
• Enable the default pool
Thank you.

Weitere ähnliche Inhalte

Was ist angesagt?

Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 

Was ist angesagt? (20)

Kafka Security
Kafka SecurityKafka Security
Kafka Security
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Farming hadoop in_the_cloud
Farming hadoop in_the_cloudFarming hadoop in_the_cloud
Farming hadoop in_the_cloud
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
 
Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud? Where to Deploy Hadoop: Bare Metal or Cloud?
Where to Deploy Hadoop: Bare Metal or Cloud?
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
大数据数据治理及数据安全
大数据数据治理及数据安全大数据数据治理及数据安全
大数据数据治理及数据安全
 
sql on hadoop
sql on hadoop sql on hadoop
sql on hadoop
 
Hadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster AccessHadoop Operations: How to Secure and Control Cluster Access
Hadoop Operations: How to Secure and Control Cluster Access
 

Andere mochten auch

Andere mochten auch (20)

Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-Tenancy
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Hadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural PatternsHadoop in the Cloud: Common Architectural Patterns
Hadoop in the Cloud: Common Architectural Patterns
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Overview on Azure Machine Learning
Overview on Azure Machine LearningOverview on Azure Machine Learning
Overview on Azure Machine Learning
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Pentaho BigDataParis_session_20170306
Pentaho BigDataParis_session_20170306Pentaho BigDataParis_session_20170306
Pentaho BigDataParis_session_20170306
 
Un 2016 da record per l’interscambio economico tra Italia e Germania
Un 2016 da record per l’interscambio economico tra Italia e GermaniaUn 2016 da record per l’interscambio economico tra Italia e Germania
Un 2016 da record per l’interscambio economico tra Italia e Germania
 
Cashgate Scandal Malawi: Different Types Of Fashion Styles
Cashgate Scandal Malawi: Different Types Of Fashion StylesCashgate Scandal Malawi: Different Types Of Fashion Styles
Cashgate Scandal Malawi: Different Types Of Fashion Styles
 
Controlling Technical Debt with Continuous Delivery
Controlling Technical Debt with Continuous DeliveryControlling Technical Debt with Continuous Delivery
Controlling Technical Debt with Continuous Delivery
 
ACCIONA Reports 65
ACCIONA Reports 65ACCIONA Reports 65
ACCIONA Reports 65
 

Ähnlich wie Configuring a Secure, Multitenant Cluster for the Enterprise

Pa cloudera manager-api's_extensibility_v2
Pa   cloudera manager-api's_extensibility_v2Pa   cloudera manager-api's_extensibility_v2
Pa cloudera manager-api's_extensibility_v2
ClouderaUserGroups
 
The Future of Data Management - the Enterprise Data Hub
The Future of Data Management - the Enterprise Data HubThe Future of Data Management - the Enterprise Data Hub
The Future of Data Management - the Enterprise Data Hub
DataWorks Summit
 

Ähnlich wie Configuring a Secure, Multitenant Cluster for the Enterprise (20)

Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015Hadoop security @ Philly Hadoop Meetup May 2015
Hadoop security @ Philly Hadoop Meetup May 2015
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
Comprehensive Security for the Enterprise II: Guarding the Perimeter and Cont...
 
Extending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via APIExtending and Automating Cloudera Manager via API
Extending and Automating Cloudera Manager via API
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
 
Pa cloudera manager-api's_extensibility_v2
Pa   cloudera manager-api's_extensibility_v2Pa   cloudera manager-api's_extensibility_v2
Pa cloudera manager-api's_extensibility_v2
 
Five Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWSFive Tips for Running Cloudera on AWS
Five Tips for Running Cloudera on AWS
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoop
 
Cloudera GoDataFest Security and Governance
Cloudera GoDataFest Security and GovernanceCloudera GoDataFest Security and Governance
Cloudera GoDataFest Security and Governance
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
 
Bringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache HadoopBringing Trus and Visibility to Apache Hadoop
Bringing Trus and Visibility to Apache Hadoop
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
Cloudera User Group Chicago - Cloudera Manager: APIs & ExtensibilityCloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
Cloudera User Group Chicago - Cloudera Manager: APIs & Extensibility
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
 
The Future of Data Management - the Enterprise Data Hub
The Future of Data Management - the Enterprise Data HubThe Future of Data Management - the Enterprise Data Hub
The Future of Data Management - the Enterprise Data Hub
 

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 

Configuring a Secure, Multitenant Cluster for the Enterprise

  • 1. Configuring a secure, multitenant cluster for the enterprise James Kinley // Principal Solutions Architect
  • 2. © 2014 Cloudera, Inc. All rights reserved. 2 About me • James Kinley • Principal Solutions Architect, EMEA • Hadoop user since 2010 • Clouderan since 2012 • Background in UK defence industry and cyber security • github.com/jrkinley • jameskinley.tumblr.com • @jrkinley • uk.linkedin.com/in/jameskinley
  • 3. © 2014 Cloudera, Inc. All rights reserved. 3 Introduction: Data Hub Objectives • Sharing Data better insight • Sharing Compute better utilisation and performance • Consolidated Operations reduced cost and complexity
  • 4. Multitenancy in Hadoop refers to a set of features that enable multiple groups from within the same organisation to share the common set of resources in a cluster without negatively impacting service-levels, violating security constraints, or even revealing the existence of each other, all via policy rather than physical separation. © 2014 Cloudera and/or its affiliates. All rights reserved. 4
  • 5. © 2014 Cloudera, Inc. All rights reserved. 5 Multitenant Cluster Architecture • Security & Governance • HDFS Information Architecture (IA) • Authentication • Authorisation • Auditing • Quota management • Resource Isolation & Management • Static partitioning • Dynamic partitioning • Impala admission control PARTNER LOGO
  • 6. © 2014 Cloudera, Inc. All rights reserved. 6 Security & Governance • HDFS Information Architecture: file and directory structure • Authentication: proves users are who they say they are [Kerberos, Identity Management (LDAP)] • Authorisation: determines what users can see and do [HDFS Permissions, RBAC (Apache Sentry), Encryption] • Auditing: determines who did what, and when [Cloudera Navigator]
  • 7. © 2014 Cloudera, Inc. All rights reserved. 7 Security & Governance • HDFS Information Architecture (IA) drwxr-x---+ tadmin tgroup /users/{tenantId} drwxr-x--- tadmin tgroup /users/{tenantId}/archive drwxrwx---+ tadmin hive /users/{tenantId}/warehouse drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} drwxr-x---+ tadmin tgroup /users/{tenantId}/landing drwxrwx--- tadmin tgroup /users/{tenantId}/processing drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
  • 8. © 2014 Cloudera, Inc. All rights reserved. 8 Security & Governance • Authentication: Kerberos & LDAP drwxr-x---+ tadmin tgroup /users/{tenantId} drwxr-x--- tadmin tgroup /users/{tenantId}/archive drwxrwx---+ tadmin hive /users/{tenantId}/warehouse drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} drwxr-x---+ tadmin tgroup /users/{tenantId}/landing drwxrwx--- tadmin tgroup /users/{tenantId}/processing drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
  • 9. © 2014 Cloudera, Inc. All rights reserved. 9 Security & Governance • Authorisation: HDFS permissions drwxr-x---+ tadmin tgroup /users/{tenantId} drwxr-x--- tadmin tgroup /users/{tenantId}/archive drwxrwx---+ tadmin hive /users/{tenantId}/warehouse drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} drwxr-x---+ tadmin tgroup /users/{tenantId}/landing drwxrwx--- tadmin tgroup /users/{tenantId}/processing drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
  • 10. © 2014 Cloudera, Inc. All rights reserved. 10 Security & Governance • Authorisation: HDFS extended ACLs drwxr-x---+ tadmin tgroup /users/{tenantId} drwxr-x--- tadmin tgroup /users/{tenantId}/archive drwxrwx---+ tadmin hive /users/{tenantId}/warehouse drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} drwxr-x---+ tadmin tgroup /users/{tenantId}/landing drwxrwx--- tadmin tgroup /users/{tenantId}/processing drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} drwxr-x--- {tuser} tgroup Give /users/{“tingest” tenantId}/user permission processing/{over jobId}/the landing input directory: drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output $ hdfs dfs -setfacl -m user:tingest:rwx /users/{tenantId}/landing Give “hive” group permission over the landing directory: $ hdfs dfs -setfacl –m group:hive:rwx /users/{tenantId}/landing
  • 11. © 2014 Cloudera, Inc. All rights reserved. 11 Security & Governance • Authorisation: Apache Sentry (incubating) • Fine-grained, role-based access control (RBAC) • Users can see only the data and metadata to which they have been granted the privilege • Currently works with Apache Hive, Cloudera Impala, and Cloudera Search • File or Service (GRANT/REVOKE) based policy providers • Role-based privilege model • {user} > {groups} > {roles} > object > privilege • object = {server, database, table, URI} • privilege = {select, insert, all} • Supports grant permission delegation for multitenant clusters
  • 12. © 2014 Cloudera, Inc. All rights reserved. 12 Security & Governance • Authorisation: Apache Sentry (incubating) drwxr-x---+ tadmin tgroup /users/{tenantId} drwxr-x--- tadmin tgroup /users/{tenantId}/archive drwxrwx---+ tadmin hive /users/{tenantId}/warehouse drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} drwxr-x---+ tadmin tgroup /users/{tenantId}/landing drwxrwx--- tadmin tgroup /users/{tenantId}/processing drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input drwxr-x--- {tuser} tgrouDp ele/ugsaetres g/r{atnetn aanntdI dre}v/oprkoec pesrisviilnegg/e{ jtoo btIedn}a/notu’st paudtmin role: > GRANT ALL ON DATABASE {db} TO ROLE {tadmin} WITH GRANT OPTION;
  • 13. © 2014 Cloudera, Inc. All rights reserved. 13 Security & Governance • Authorisation: Encryption • Network encryption (HDFS and MR) • At-rest encryption for HDFS • Cloudera Navigator Encrypt & KeyTrustee (Gazzang) • Project Rhino (Cloudera + Intel) • HDFS-level encryption (HDFS-6134 + HADOOP-10150) • Encryption zones (HDFS-6386) • Hardware-accelerated (HADOOP-10693)
  • 14. © 2014 Cloudera, Inc. All rights reserved. 14 Security & Governance • Authorisation: HDFS encryption zone drwxr-x---+ tadmin tgroup /users/{tenantId} drwxr-x--- tadmin tgroup /users/{tenantId}/archive drwxrwx---+ tadmin hive /users/{tenantId}/warehouse drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} drwxr-x---+ tadmin tgroup /users/{tenantId}/landing drwxrwx--- tadmin tgroup /users/{tenantId}/processing drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
  • 15. © 2014 Cloudera, Inc. All rights reserved. 15 Security & Governance • Governance: HDFS disk quota management • Restrict tenants use of storage • Prevents misuse of the shared filesystem • HDFS supports two quota mechanisms • Disk space quotas • Name quotas
  • 16. © 2014 Cloudera, Inc. All rights reserved. 16 Security & Governance • Governance: HDFS disk quota management drwxr-x---+ tadmin tgroup /users/{tenantId} drwxr-x--- tadmin tgroup /users/{tenantId}/archive drwxrwx---+ tadmin hive /users/{tenantId}/warehouse drwxrwx---+ tadmin hive /users/{tenantId}/warehouse/{db}/{table}/{partition} drwxr-x---+ tadmin tgroup /users/{tenantId}/landing drwxrwx--- tadmin tgroup /users/{tenantId}/processing drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId} drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/input drwxr-x--- {tuser} tgroup /users/{tenantId}/processing/{jobId}/output
  • 17. © 2014 Cloudera, Inc. All rights reserved. 17 Resource Isolation & Management • Dividing up finite cluster resource to ensure predictable behaviour • Goals: • Guarantee service levels for critical workflows • Support fair allocation of resources between different groups of users • Prevent users from depriving other users access to the cluster
  • 18. © 2014 Cloudera, Inc. All rights reserved. 18 Resource Isolation & Management • Static partitioning • Static service pools • Statically partition resource for HBase, HDFS, Impala, Search, and YARN • Enforced by Linux cgroups
  • 19. © 2014 Cloudera, Inc. All rights reserved. 19 Resource Isolation & Management • Dynamic partitioning • Dynamic resource pools • Dynamically apportion resource [statically] allocated to Impala and YARN • Named pool of resource + scheduling policy • Resource allocation based on weight • User to pool placement policy • ACLs • SLOs (use of pre-emption)
  • 20. © 2014 Cloudera, Inc. All rights reserved. 20 Resource Isolation & Management • Impala admission control • Limits concurrent queries and memory usage • Additional queries are queued • Configured per pool • max_requests • mem_limit • max_queued • Avoids resource oversubscription (OOM) during heavy usage • Pool placement policy mechanism same as YARN RM • Use with static partitioning (independently from YARN) • Or integrate with YARN for resource management via Llama
  • 21. © 2014 Cloudera, Inc. All rights reserved. 21 Resource Isolation & Management • Classification • User to pool placement rules • Based on user, group, or specified tag: MR: mapreduce.job.queuename Impala: REQUEST_POOL
  • 22. © 2014 Cloudera, Inc. All rights reserved. 22 Resource Isolation & Management • Queues • YARN • Max running apps • Max memory • Max vcores • Impala admission control • Max running queries • Max memory • Max queue size
  • 23. © 2014 Cloudera, Inc. All rights reserved. 23 Resource Isolation & Management • Dynamic resource pools • Scheduling policy • Dominant Resource Fairness (DRF) • Fair Scheduler (FAIR) • First-in, First-out (FIFO) • Recommendations: • Disable undeclared pools • Enable the default pool

Hinweis der Redaktion

  1. Sharing Data Single repository for all data Organisation-wide view of data gives better insight Effective sharing of datasets when permitted, isolation of datasets when not Sharing Compute Allocation of resource is dynamic, optimised, and just-in-time Leading to better utilisation of cluster resources and better performance for individual requests (bursting) Across workloads (batch processing, interactive SQL, enterprise search, and advanced analytics) Consolidated Operations Amortise (repay) administrative overhead Reduce cost and complexity
  2. Multiple groups (departments, projects, users) Common set of resources (storage and compute) Security constraints (e.g. data protection policy)
  3. Identity Management (user account propagation) Cloudera Navigator (end-2-end governance) Auditing Metadata Lineage Lifecycle management (i.e. data retention)
  4. LDAP Integration Typically done at OS-level and HDFS uses shell-based group mapping User accounts propagated to all hosts PAM_LDAP SSSD Centrify VAS/QAS
  5. POSIX Access Control Lists (Hadoop 2.4) An ACL provides a way to set different permissions for specific named users or named groups, not only the file's owner and the file's group
  6. Encryption Compliance regulations: EU Data Protection Directive Gazzang OS-level encryption Enterprise-grade key management (Navigator Key Trustee) Navigator Encrypt Encrypt DN directory in Linux file system Provided by kernel module Process-based ACLs (i.e. only DN can access encrypted directory) Project Rhino HDFS-level encryption (Encryption Zones) Integrated with Navigator Key Trustee Better for multitenant type environment Hardware-accelerated Unusable if there is a significant performance penalty Uses AES instruction set available on Intel processors HDFS will be able to provide access to encrypted data with minimal performance impact
  7. Restrict tenants disk usage Prevent users from accidently or maliciously consuming too much disk space within the cluster Disk space quotas: disk space limits on a per directory basis Name quotas: limits the number of files and subdirectories within a particular directory. Helps administrators control the NN metadata
  8. Analogy Messi, Neymar, and Suarez pop out for Pizza after training They do not have enough money to buy a large Pizza each, so they put their money together to buy one; Once they have the Pizza they agree on a policy to share the Pizza; Because the Pizza has 10 slices, they agree that Messi can eat 4 slices, and Neymar and Suarez can eat 3 each; They can eat in parallel, but can only eat one slice at a time.
  9. I.e. route tenants users based on their AD group membership
  10. Impala Incoming queries are executed, queued, or rejected Queue if too many queries or not enough memory Reject if queue is full
  11. Disabling undeclared pools When user does not specify a pool When enabled, a pool is created on-the-fly with the name of the user that submitted the request When disabled, the default pool is used instead Enabling the default pool When user specifies pool that doesn’t exist When disabled, the pool is created on-the-fly with the default settings When enabled, the default pool is used instead