Access to Hadoop clusters through dedicated portal nodes (typically located behind firewalls and performing user authentication and authorization) can have several drawbacks -- as shared multitenant resources they can create contention among users and increase the maintenance overhead for cluster administrators. This session will discuss the Gateway system, a cluster virtualization framework that provides multiple benefits: seamless access from users’ workplace computers through corporate firewalls; the ability to failover to active clusters for scheduled or unscheduled downtime, as well as the ability to redirect traffic to other clusters during upgrades; and user access to clusters running different versions of Hadoop.
3. Cluster Access via Portal Nodes
• Users access Hadoop clusters via dedicated portal nodes located behind
corporate firewalls
– Login (ssh) to the portal node: authentication and authorization
– Access clusters: run HDFS commands, submit jobs
NameNode JobTracker
Portal Node(s)
Firewall
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
3 eBay Inc. confidential
4. Use Case #1: Development of New Applications
• Developers of new applications fall into a cycle of
moving programs, input and output data
between their dev boxes, the portal, and the Hadoop clusters.
Develop an application;
while( my manager is unsatisfied ) {
build application on your desktop;
scp myapp.jar or in.mydata to the portal node;
Run application on the cluster (data in HDFS);
Verify job results with the manager;
Fix application bugs or develop more;
}
Offload output data from the cluster;
4 eBay Inc. confidential
5. Use Case #2: Access to Public Datasets
• Scientific data:
– Genomics datasets
– Fundamental physics experiments (LHC in Nebraska)
– Astronomical images
• Data is public, but not the servers used to store and process data
• Geographically separated datacenters
• Users should be able to access and analyze data via internet
• Implies direct login to the clusters for everybody
– Complex security issues
5 eBay Inc. confidential
6. Problem: Portal Nodes as Shared Resources
• Developers hate transferring programs to portal nodes
• Input data should be first transferred to the portal, then to HDFS
• Developers tend to use portals as their dev nodes
– Setup development environments
– Connect to git repositories
• Portals are shared multi-tenant resources
– Community property
• Portal nodes become yet another cluster component
– Maintenance overhead for cluster administrators
• Public datasets: need access without direct login to cluster portals
6 eBay Inc. confidential
7. Gateway Project: Main Objective
• Gateway is a cluster virtualization framework, which
provides a unified and seamless access to Hadoop clusters
from users’ workplace computers through corporate firewalls.
Gateway Server(s)
NameNode JobTracker
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
7 eBay Inc. confidential
8. Gateway Project: Principal Benefits
1. Unified access to multiple Hadoop clusters through the corporate firewalls
– Multiple clusters within the same datacenter
“HDFS Scalability: The limits to growth” USENIX ;login: 2010
Connotations with Federation in implementation (ViewFS) and purpose
– Clusters in different datacenters
2. Service availability:
failover to active clusters when one has scheduled/unscheduled downtime
3. Flexible cluster upgrades:
redirect traffic to other clusters when one is upgrading
4. Versioning:
access to clusters running different versions of Hadoop
5. Load balancing:
smart job submission based on cluster workloads
8 eBay Inc. confidential
9. Network Requirements
• Gateway Servers are positioned on the boundary between the corporate
and “public” networks
• Gateway Servers can
– communicate with the user desktops/laptops residing on public network and to
Hadoop clusters running in different data centers within corp. network.
• Due to firewalls there is no direct connectivity from the public network to
Hadoop clusters and vice versa other than via the Gateway Servers.
• Gateway plays the role of a proxy between users and Hadoop clusters
– Users delegate execution of their jobs and HDFS commands to the Gateway
servers.
– The servers talk to the actual clusters and return the replies back to the users.
9 eBay Inc. confidential
10. Functional Requirements
• The cluster virtualization framework need to support
– current Java and command line user facing Hadoop APIs
– existing Hadoop applications and jobs should continue to run from user boxes the
same way as they used to from portal nodes
• Transparent use of client side libraries:
– Pig, Hive, Cascading, Hadoop shell commands
• Authorization and Authentication
– As a replacement for existing portal nodes, Gateway should provide adequate
user authentication and authorization
• Unified WEB UI combining UIs of the serviced clusters
10 eBay Inc. confidential
11. Gateway Architecture
Gateway Virtualization Framework has two main components:
• Job Submission system, represented by
– Gateway MapReduce Server (GWMRServer) on the server side, and
– regular Hadoop job submission and status tracking tools
contacting GWMRServer via the standard Hadoop JobClient.
• Virtualization of File System Access is represented by
– GatewayFileSystem on the client side and
– Gateway File System Server (GWFSServer) on the server side.
GWMR
JobClient
Server
Server
GWFS
gwfs://
11 eBay Inc. confidential
12. Job Submission
• Hadoop uses JobClient to submit jobs
– Job is defined by its configuration file and the job jar
– JobClient loads these two files along with other user-specified files required for
the job to HDFS and submits the job to the JobTracker
– the job is then scheduled for execution
• GWMRServer is the only component needed to virtualize job submission.
No specialized gateway client is required
• Regular Hadoop JobClients are configured to send submissions to
GWMRServer instead of a JobTracker
• Job Submission Virtualization allows submitting jobs to multiple MR clusters
via GWMRServer
• GWMRServer selects one of the clusters and further submits the job to the
respective JobTracker
12 eBay Inc. confidential
13. HDFS Access
• File System Access virtualized via GatewayFileSystem and GWFSServer
• GatewayFileSystem is a new specialized client for accessing HDFS clusters
via GWFSServer
– The client is instantiated automatically based on configuration parameters setup
to access gateway server instead of HDFS
– GatewayFileSystem passes the client request to GWFSServer
– The gateway server instantiates a traditional HDFS client (DistributedFileSystem)
pointing to the requested cluster
– Executes the request on the cluster and returns the result back to the gateway
client
• Unlike Job Submission the virtualized Files System Access is always cluster
aware
– If a user accesses a file he should explicitly specify, which HDFS cluster the file
belongs to
13 eBay Inc. confidential
14. GWMR: Implementation
• GWMRServer is a subclass of mapred.JobSubmissionProtocol (H-0.20)
mapreduce.ClientProtocol (> H-0.20)
• GWMRServer can be accessed via regular Hadoop command-line-interface
and Java interface
• MR clients communicate (submit jobs and obtain job information) directly
with GWMRServer as if they talk to a real JobTracker via hadoop.RPC
• GWMRServer redirects the job to one of the clusters, based on
– Data location
– Cluster workload
– User group information
14 eBay Inc. confidential
15. GWMR: Implementation Continued
• GWMRServer is stateless (or keeps a very lightweight state)
– allows setting up pools of Gateway servers in order to avoid single point of failure
• On startup GWMRServer reads configuration from “gateway-site.xml”,
which determines the Hadoop MR clusters it must serve
• GWMRServer has a web UI, similar to the JobTracker UI, which aggregates
data from available JobTrackers
• GWMRServer supports job sequencing, so that chaining MR jobs initiated
by a single Pig or Hive job were scheduled to the same cluster
15 eBay Inc. confidential
16. GatewayFileSystem: Implementation
• GatewayFileSystem is a subclass of the FileSystem abstract class
Similar to LocalFileSystem, HFTPFileSystem, S3FileSystem
– gwfs:// - GatewayFileSystem
– file:// - LocalFileSystem
– hdfs:// - DistributedFileSystem
– har:// - HarFileSystem
– hftp:// - HFTPFileSystem
– s3:// - S3FileSystem
– kfs:// - KFSFileSystem
• GatewayFileSystem is instantiated based on the URI scheme listed in
fs.default.name (fs.defaultFS) field of core-site.xml
– fs.default.name = gwfs://<GWFSServer-address>
– fs.gwfs.impl = org.apache.hadoop.gateway.fs.GatewayFileSystem
16 eBay Inc. confidential
17. GWFSServer: Implementation
• GatewayFileSystem passes client requests to GWFSServer using
– a new RPC protocol – GWFSProtocol, and
– a new binary data transfer protocol – DataTProtocol
• The Gateway server processes GWFSProtocol requests
– It instantiates a real DistributedFileSystem pointing to the required cluster,
– executes the request and returns results back to the gateway client
• DataTProtocol transfers data between the Gateway clients and the server
– The data transfer is a direct pipeline between a gateway client and HDFS
– GWFSServer reads data from HDFS and pipelines it to gateway client via
DataTProtocol, and vice versa for write
• GWFSServer is stateless. This will allow setting up pools of servers in order
to avoid single point of failure and to provide load balancing
17 eBay Inc. confidential
18. Versioning
• GWMRServer can serve JobClients of a specific version only. Incompatible
version of Hadoop will require different implementations of GWMRServer
– The service will run multiple versions of GWMRServer so that client requests
could be redirected to a server serving the compatible version
• Same instance of GWMRServer can submit jobs and query map-reduce
clusters running different versions of Hadoop
– GWMRServer discovers the Hadoop version of a particular cluster, and uses the
respective Hadoop jars to instantiate an appropriate version of the JobClient
• GatewayFileSystem-to-GWFSServer communication is independent of
HDFS
– No need to implement a new GWFSServer for every new Hadoop release.
• Same instance of GWFSServer can access clusters running different
versions of HDFS
– GWFSServer discovers the HDFS cluster version and uses the respective jars to
instantiate an appropriate version of the DistributedFileSystem
18 eBay Inc. confidential
19. Project Status
• Support for Hadoop 0.20.xxx
Plan for 0.22
• Authorization & Authentication
• Job Chaining
• Packaging
– It is convenient for users to have
the entire Hadoop client suite
installed, configured,
and packaged as a VM
• Plan to open-source soon
• Developers wanted
19 eBay Inc. confidential