Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

Gateway
Cluster Virtualization Framework

Konstantin V Shvachko
Po Cheung
Priyo Mustafi
Hadoop Platform Team, eBay

Hadoop World Conference
November 9, 2011

Hadoop Cluster Components

• HDFS – a distributed file system
– NameNode – namespace and block management
– DataNodes – block replica container
– BackupNode – checkpointer

• MapReduce – a framework for distributed computations
– JobTracker – job scheduling, resource management, lifecycle coordination
– TaskTracker – task execution module

NameNode JobTracker

TaskTracker TaskTracker TaskTracker

DataNode DataNode DataNode

2 eBay Inc. confidential

Cluster Access via Portal Nodes

• Users access Hadoop clusters via dedicated portal nodes located behind
corporate firewalls
– Login (ssh) to the portal node: authentication and authorization
– Access clusters: run HDFS commands, submit jobs

NameNode JobTracker
Portal Node(s)
Firewall




Use Case #1: Development of New Applications

• Developers of new applications fall into a cycle of
moving programs, input and output data
between their dev boxes, the portal, and the Hadoop clusters.

Develop an application;
while( my manager is unsatisfied ) {
build application on your desktop;
scp myapp.jar or in.mydata to the portal node;
Run application on the cluster (data in HDFS);
Verify job results with the manager;
Fix application bugs or develop more;
}
Offload output data from the cluster;


Use Case #2: Access to Public Datasets

• Scientific data:
– Genomics datasets
– Fundamental physics experiments (LHC in Nebraska)
– Astronomical images

• Data is public, but not the servers used to store and process data
• Geographically separated datacenters
• Users should be able to access and analyze data via internet
• Implies direct login to the clusters for everybody
– Complex security issues


Problem: Portal Nodes as Shared Resources

• Developers hate transferring programs to portal nodes
• Input data should be first transferred to the portal, then to HDFS
• Developers tend to use portals as their dev nodes
– Setup development environments
– Connect to git repositories

• Portals are shared multi-tenant resources
– Community property

• Portal nodes become yet another cluster component
– Maintenance overhead for cluster administrators

• Public datasets: need access without direct login to cluster portals


Gateway Project: Main Objective

• Gateway is a cluster virtualization framework, which
provides a unified and seamless access to Hadoop clusters
from users’ workplace computers through corporate firewalls.
Gateway Server(s)

NameNode JobTracker




Gateway Project: Principal Benefits

1. Unified access to multiple Hadoop clusters through the corporate firewalls
– Multiple clusters within the same datacenter
“HDFS Scalability: The limits to growth” USENIX ;login: 2010
Connotations with Federation in implementation (ViewFS) and purpose
– Clusters in different datacenters

2. Service availability:
failover to active clusters when one has scheduled/unscheduled downtime
3. Flexible cluster upgrades:
redirect traffic to other clusters when one is upgrading
4. Versioning:
access to clusters running different versions of Hadoop
5. Load balancing:
smart job submission based on cluster workloads


Network Requirements

• Gateway Servers are positioned on the boundary between the corporate
and “public” networks
• Gateway Servers can
– communicate with the user desktops/laptops residing on public network and to
Hadoop clusters running in different data centers within corp. network.

• Due to firewalls there is no direct connectivity from the public network to
Hadoop clusters and vice versa other than via the Gateway Servers.
• Gateway plays the role of a proxy between users and Hadoop clusters
– Users delegate execution of their jobs and HDFS commands to the Gateway
servers.
– The servers talk to the actual clusters and return the replies back to the users.


Functional Requirements

• The cluster virtualization framework need to support
– current Java and command line user facing Hadoop APIs
– existing Hadoop applications and jobs should continue to run from user boxes the
same way as they used to from portal nodes

• Transparent use of client side libraries:
– Pig, Hive, Cascading, Hadoop shell commands

• Authorization and Authentication
– As a replacement for existing portal nodes, Gateway should provide adequate
user authentication and authorization

• Unified WEB UI combining UIs of the serviced clusters


Gateway Architecture

Gateway Virtualization Framework has two main components:
• Job Submission system, represented by
– Gateway MapReduce Server (GWMRServer) on the server side, and
– regular Hadoop job submission and status tracking tools
contacting GWMRServer via the standard Hadoop JobClient.

• Virtualization of File System Access is represented by
– GatewayFileSystem on the client side and
– Gateway File System Server (GWFSServer) on the server side.

GWMR
JobClient
Server
Server
GWFS

gwfs://


Job Submission

• Hadoop uses JobClient to submit jobs
– Job is defined by its configuration file and the job jar
– JobClient loads these two files along with other user-specified files required for
the job to HDFS and submits the job to the JobTracker
– the job is then scheduled for execution

• GWMRServer is the only component needed to virtualize job submission.
No specialized gateway client is required
• Regular Hadoop JobClients are configured to send submissions to
GWMRServer instead of a JobTracker
• Job Submission Virtualization allows submitting jobs to multiple MR clusters
via GWMRServer
• GWMRServer selects one of the clusters and further submits the job to the
respective JobTracker


HDFS Access

• File System Access virtualized via GatewayFileSystem and GWFSServer
• GatewayFileSystem is a new specialized client for accessing HDFS clusters
via GWFSServer
– The client is instantiated automatically based on configuration parameters setup
to access gateway server instead of HDFS
– GatewayFileSystem passes the client request to GWFSServer
– The gateway server instantiates a traditional HDFS client (DistributedFileSystem)
pointing to the requested cluster
– Executes the request on the cluster and returns the result back to the gateway
client

• Unlike Job Submission the virtualized Files System Access is always cluster
aware
– If a user accesses a file he should explicitly specify, which HDFS cluster the file
belongs to


GWMR: Implementation

• GWMRServer is a subclass of mapred.JobSubmissionProtocol (H-0.20)
mapreduce.ClientProtocol (> H-0.20)
• GWMRServer can be accessed via regular Hadoop command-line-interface
and Java interface
• MR clients communicate (submit jobs and obtain job information) directly
with GWMRServer as if they talk to a real JobTracker via hadoop.RPC
• GWMRServer redirects the job to one of the clusters, based on
– Data location
– Cluster workload
– User group information


GWMR: Implementation Continued

• GWMRServer is stateless (or keeps a very lightweight state)
– allows setting up pools of Gateway servers in order to avoid single point of failure

• On startup GWMRServer reads configuration from “gateway-site.xml”,
which determines the Hadoop MR clusters it must serve
• GWMRServer has a web UI, similar to the JobTracker UI, which aggregates
data from available JobTrackers
• GWMRServer supports job sequencing, so that chaining MR jobs initiated
by a single Pig or Hive job were scheduled to the same cluster


GatewayFileSystem: Implementation

• GatewayFileSystem is a subclass of the FileSystem abstract class
Similar to LocalFileSystem, HFTPFileSystem, S3FileSystem
– gwfs:// - GatewayFileSystem
– file:// - LocalFileSystem
– hdfs:// - DistributedFileSystem
– har:// - HarFileSystem
– hftp:// - HFTPFileSystem
– s3:// - S3FileSystem
– kfs:// - KFSFileSystem

• GatewayFileSystem is instantiated based on the URI scheme listed in
fs.default.name (fs.defaultFS) field of core-site.xml
– fs.default.name = gwfs://<GWFSServer-address>
– fs.gwfs.impl = org.apache.hadoop.gateway.fs.GatewayFileSystem


GWFSServer: Implementation

• GatewayFileSystem passes client requests to GWFSServer using
– a new RPC protocol – GWFSProtocol, and
– a new binary data transfer protocol – DataTProtocol

• The Gateway server processes GWFSProtocol requests
– It instantiates a real DistributedFileSystem pointing to the required cluster,
– executes the request and returns results back to the gateway client

• DataTProtocol transfers data between the Gateway clients and the server
– The data transfer is a direct pipeline between a gateway client and HDFS
– GWFSServer reads data from HDFS and pipelines it to gateway client via
DataTProtocol, and vice versa for write

• GWFSServer is stateless. This will allow setting up pools of servers in order
to avoid single point of failure and to provide load balancing


Versioning

• GWMRServer can serve JobClients of a specific version only. Incompatible
version of Hadoop will require different implementations of GWMRServer
– The service will run multiple versions of GWMRServer so that client requests
could be redirected to a server serving the compatible version

• Same instance of GWMRServer can submit jobs and query map-reduce
clusters running different versions of Hadoop
– GWMRServer discovers the Hadoop version of a particular cluster, and uses the
respective Hadoop jars to instantiate an appropriate version of the JobClient

• GatewayFileSystem-to-GWFSServer communication is independent of
HDFS
– No need to implement a new GWFSServer for every new Hadoop release.

• Same instance of GWFSServer can access clusters running different
versions of HDFS
– GWFSServer discovers the HDFS cluster version and uses the respective jars to
instantiate an appropriate version of the DistributedFileSystem


Project Status

• Support for Hadoop 0.20.xxx
Plan for 0.22
• Authorization & Authentication
• Job Chaining
• Packaging
– It is convenient for users to have
the entire Hadoop client suite
installed, configured,
and packaged as a VM

• Plan to open-source soon
• Developers wanted


Thank You!


Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay

Ähnlich wie Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay (20)

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop World 2011: Hadoop Gateway - Konstantin Schvako, eBay