Aaron T. Myers (ATM), Software Engineer, Cloudera, Inc.
The era of “Big Data for the masses” is upon us. Despite the mindshare Big Data has been receiving – driven by the development and distribution of Apache Hadoop, the first commercialized release was only in December of 2011 by Cloudera, Inc. Cloudera remains the leading Hadoop platform provider in the market today. Now, with a diverse enterprise and government early adopter customer list, through Cloudera we can get a bird’s eye view of the leading authentication issues beginning to emerge from these companies headed out of the sandbox and into full production.
Speaker Aaron T. Myers (ATM) was one of Cloudera’s earliest engineers and maintains a core focus on Apache Hadoop core, specifically focused on HDFS and Hadoop’s security features. ATM is an Apache Hadoop PMC Member and Committer.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
SECURING HADOOP ECOSYSTEM
1. Securing
the
Hadoop
Ecosystem
Aaron
T.
Myers
(ATM)
@
Cloudera
Cloud
Iden?ty
Summit,
July
2013
2. Who
am
I?
• SoHware
Engineer
at
Cloudera
• Hadoop
CommiJer
and
PMC
Member
at
Apache
SoHware
Founda?on
• Primarily
work
on
Hadoop
Security
and
HDFS
• Masters
thesis
focused
on
systems
security
3. Agenda
• What
is
Hadoop?
• Hadoop
Ecosystem
Interac?ons
• Hadoop
Authen?ca?on
• Hadoop
Authoriza?on
• IT
Infrastructure
Integra?on
• The
Future:
Where
Hadoop
Security
is
Headed
4. Hadoop
Is…
• A
distributed
system
• Designed
for
massive
scaling
of
storage
and
compute
across
many
(10s-‐1000s)
nodes
• An
ecosystem
• Hadoop
is
the
kernel,
apps
on
top
are
user-‐level
programs
• e.g.
Impala,
Hive,
Oozie,
HBase,
etc.
• A
security
pain
• Designed
to
run
arbitrary
code
submiJed
by
users
• Another
place
where
many
users
interact
with
the
system
• Many
orgs
provide
“Hadoop
as
a
service”
5. Hadoop
Is…
• Not
secure
by
default
• No
authen?ca?on
whatsoever
• Usually
behind
a
corporate
firewall
• OHen
accessed
by
common
BI
tools
• Tableau,
SAS,
Microstrategy,
etc.
• Expected
to
be
integrated
into
corporate
IT
infra
• SSO,
etc.
6. Hadoop
on
its
Own
Hadoop
NN
DN
TT
JT
DN
TT
DN
TT
MR
client
Map
Task
Map
Task
Reduce
Task
SNN
hdfs,
hJpfs
&
mapred
users
end
users
protocols:
RPC/data
transfer/HTTP
H6pFS
HDFS
client
WebHdfs
client
9. • Hadoop
Authen?ca?on
based
on
Kerberos
• Usually
MIT,
also
Ac?ve
Directory
• End
Users
to
services,
as
a
user
• CLI
&
libraries:
Kerberos
(kinit
or
keytab)
• Web
UIs:
Kerberos
SPNEGO
&
pluggable
HTTP
auth
• Services
to
Services,
as
a
service
• Creden?als:
Kerberos
(keytab)
• Services
to
Services,
on
behalf
of
a
user
• Proxy-‐user
(aHer
Kerberos
for
service)
• Job
tasks
to
Services,
on
behalf
of
a
user
• Job
delega?on
token
Authen?ca?on
Details
10. • HDFS
Data
• File
System
permissions
(Unix
like
user/group
permissions)
• HBase
Data
• Read/Write
Access
Control
Lists
(ACLs)
at
table
level
• Hive
Metastore
(Hive,
Impala)
• Leverages/proxies
HDFS
permissions
for
tables
&
par??ons
• Hive
Server
(Hive,
Impala)
(coming)
• More
advanced
GRANT/REVOKE
with
ACLs
for
tables
• Jobs
(Hadoop,
Oozie)
• Job
ACLs
for
Hadoop
Scheduler
Queues,
manage
&
view
jobs
• Zookeeper
• ACLs
at
znodes,
authen?cated
&
read/write
Authoriza?on
Details
11. IT
Integra?on:
Kerberos
• Users
don’t
want
Yet
Another
Creden?al
• Corp
IT
doesn’t
want
to
provision
thousands
of
service
principals
• Solu?on:
local
KDC
+
one-‐way
trust
• Run
a
KDC
(usually
MIT
Kerberos)
in
the
cluster
• Put
all
service
principals
here
• Set
up
one-‐way
trust
of
central
corporate
realm
by
local
KDC
• Normal
user
creden?als
can
be
used
to
access
Hadoop
12. IT
Integra?on:
Groups
• Much
of
Hadoop
authoriza?on
uses
“groups”
• User
‘atm’
might
belong
to
groups
‘analysts’,
‘eng’,
etc.
• Users’
groups
are
not
stored
in
Hadoop
anywhere
• Refers
to
external
system
to
determine
group
membership
• NN/JT/Oozie/Hive
servers
all
must
perform
group
mapping
• Default
plugins
for
user/group
mapping:
• ShellBasedUnixGroupsMapping
–
forks/runs
`/bin/id’
• JniBasedUnixGroupsMapping
–
makes
a
system
call
• LdapGroupsMapping
–
talks
directly
to
an
LDAP
server
13. IT
Integra?on:
Kerberos
+
LDAP
Hadoop
Cluster
Local
KDC
hdfs/host1@HADOOP.EXAMPLE.COM
yarn/host2@HADOOP.EXAMPLE.COM
…
Central
Ac?ve
Directory
tucu@EXAMPLE.COM
atm@EXAMPLE.COM
…
Cross-‐realm
trust
NN
JT
LDAP
group
mapping
14. IT
Integra?on:
Web
Interfaces
• Most
web
interfaces
authen?cate
using
SPNEGO
• Standard
HTTP
authen?ca?on
protocol
• Used
internally
by
services
which
communicate
over
HTTP
• Most
browsers
support
Kerberos
SPNEGO
authen?ca?on
• Hadoop
components
which
use
servlets
for
web
interfaces
can
plug
in
custom
filter
• Integrate
with
intranet
SSO
HTTP
solu?on
15. IT
Integra?on:
Web
Interfaces
• Most
web
interfaces
authen?cate
using
SPNEGO
• Standard
HTTP
authen?ca?on
protocol
• Used
internally
by
services
which
communicate
over
HTTP
• Most
browsers
support
Kerberos
SPNEGO
authen?ca?on
• Hadoop
components
which
use
servlets
for
web
interfaces
can
plug
in
custom
filter
• Integrate
with
intranet
SSO
HTTP
solu?on
16. Issues
with
Hadoop
Security
• SSO
is
poorly
and
not
universally
supported
• Only
supported
for
the
web
interfaces,
liJle
used,
etc.
• Kerberos
the
only
op?on
• Not
all
orgs
comfortable
administering
net
new
Kerberos
realm
• Not
well-‐suited
for
cloud
deployments
• Need
properly
working
reverse
DNS
• Pain
to
provision
KDC,
distribute
keytabs
• Kerberos
tough
for
management
tools
• No
Kerberos
administra?ve
API/protocol
17. Issues
with
Hadoop
Security
(cont.)
• Isola?on
of
user
tasks
currently
requires
separate
local
Unix
accounts
on
all
boxes
• Need
to
integrate
with
LDAP
using
PAM
or
something
like
it
• HDFS
authoriza?on
only
supports
Unix-‐style
permissions
• Not
expressive
enough
for
some
applica?ons,
e.g.
Hive
18. Future
Development
• Full
SSO
support
• OAUTH
the
most
commonly
requested,
first
goal
• Decouple
Hadoop
RPC
implementa?on
from
Kerberos
• Make
authen?ca?on
system
fully
pluggable
for
custom
implementa?ons
• Any
service
which
can
provide
bidirec?onal
authen?ca?on
• Improve
management
tools
• Cloudera
Manager
can
manage
more
of
the
security
infrastructure
19. Future
Development
(cont.)
• Use
beJer
isola?on
methods
for
user
tasks
• Linux
containers
• Solaris
“zones”
• Etc.
• BeJer
authoriza?on
capabili?es
• Talk
of
adding
ACL
support
to
HDFS
• Hive
Server
2
will
provide
rich
authoriza?on
capabili?es