Presentation at the 1st Workshop on Science of Cyberinfrastructure: Research, Experience, Applications and Models.
Topic: A Case study by the Big Data Systems Lab group at Clemson University on setting up non-root dynamic provisioning of two big data infrastructures on a shared research computing resource: Hadoop and HPCC Systems.
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks
1. Dynamic Provisioning of Data Intensive
Computing Middleware Frameworks: A Case
Study
Linh B. Ngo1
Michael E. Payne1
Flavio Villanustre2
Richard Taylor2
Amy W. Apon1
1School of Computing, Clemson University
2LexisNexis® Risk Solutions
2. Contents
1.
Overview
of
Clemson
University’s
Cyberinfrastructure
Resource
2.
Demand
for
Dynamic
Data-‐Intensive
Compu@ng
Middleware
Frameworks
3.
Dynamic
Provisioning
of
Data-‐Intensive
Compu@ng
Framework
4.
Deploying
Hadoop
Ecosystem
vs.
Deploying
HPCC
Systems®
5.
Lessons
Learned
3. Cyberinfrastructure Resource at Clemson University
Condominium model
2,007 Computer Nodes (21,400 cores), including 276 GPU nodes
Sustained 551 Tflops (benchmarked on GPU nodes only)
1289 active users, 12 academic departments across 36 fields of research
Facilities
4. Cyberinfrastructure Resource at Clemson University
• 1G/10G/Myrinet-‐10G/Infiniband-‐40G/Infiniband-‐56G
• Local
storage
between
100-‐200GB
(majority)
and
400-‐900GB
(since
2013)
• Shared
233TB
OrangeFS
scratch
space
and
more
than
3PB
archival
space
5. Demand
for
Dynamic
Data-‐Intensive
Compu@ng
Middleware
Frameworks
• Genome
Sequencing
(Hadoop
MapReduce/GPGPU)
• Molecular
Dynamic
Forward
Flux
Sampling
(Hadoop
Streaming/LAMMPS)
• Streaming
Data
Infrastructure
for
Connected
Vehicle
System
(Hadoop
Distributed
File
System/Spark/Ka_a)
• Big
Scholarly
Data
(HPCC
Systems)
• CS
Course
in
Distributed
and
Cluster
Compu@ng
(MPI/MapReduce,
Hadoop/Spark/HPCC
Systems®
…)
6. Demand
for
Dynamic
Data-‐Intensive
Compu@ng
Middleware
Frameworks
• Changes
in
cyberinfrastructure
support
model
for
data
infrastructure:
– Beyond
a
tradi@onal
remote
distributed
file
system
model
– From
sta@c
and
dedicated
resource
to
dynamic
resource
– Data
management
processes
co-‐locate
with
compu@ng
processes
• Challenges
for
system
administrators:
– Accommoda@ng
different
frameworks
for
different
research
– Complying
with
exis@ng
administra@ve
policy
and
scheduling
priority
• What
can
users
do?
– Deploying
dynamic
data-‐intensive
compu@ng
frameworks
within
the
limits
of
user
privilege
and
without
the
interven@on
of
administrators
7. Dynamic
Provisioning
of
Data-‐Intensive
Compu@ng
Framework:
Installa@on
• Where
to
install
1. Home
directory:
Persistent,
limited
in
storage
2. Shared
distributed
storage:
Fast,
semi-‐persistent,
“unlimited”
storage
3. Local
storage
on
compute
node:
Fast,
non-‐persistent,
requires
reinstalla@on
• How
to
handle
dependencies
1. Ideally
in
home
or
shared
distributed
storage
(persistency)
2. Dynamic
loading
mechanisms
via
environment
paths
8. Target
deployment
directories
on
local
disks
PBS_NODEFILE
Deployment/
ConfiguraBon
Scripts
1
2
3
4
user.palmeHo.clemson.edu
Dynamic
Provisioning
of
Data-‐Intensive
Compu@ng
Framework:
Deployment
9. Deploying
Hadoop
Ecosystem
vs.
deploying
HPCC
Systems®:
Overview
• Open
source
alterna@ves
based
on
the
conceptual
architecture
of
a
data-‐intensive
compu@ng
infrastructure
developed
by
Google
• Comprehensive
data-‐intensive
compu@ng
system
targe@ng
enterprise
users,
developed
in
early
2000,
open
source
since
2011
10. Deploying
Hadoop
Ecosystem
vs.
deploying
HPCC
Systems®:
Installa@on:
Hadoop
• Self-‐contained,
pre-‐compiled
jar
files
• No
installa@on
is
needed,
relies
on
shell
scripts
to
launch
component
daemons
• Dependencies:
JDK
11. Deploying
Hadoop
Ecosystem
vs.
deploying
HPCC
Systems®:
Installa@on:
HPCC
Systems
• Standard
configure/make/make
install
– Assump@on
about
an
industrial
produc@on
environment
(with
administra@ve
privileges)
– Modifica@on
to
avoid
hard-‐coded
system
installa@on
paths
– Modifica@on
of
template
XML
configura@on
files
to
avoid
default
HPCC
Systems-‐specific
user
crea@on
and
administra@ve
check
• Dependencies:
– Not
on
Palmeko:
ICU,
Xalan,
Xerces,
APR
…
– On
Palmeko
but
no
correct
version:
Binu@ls
12. Deploying
Hadoop
Ecosystem
vs.
deploying
HPCC
Systems:
Deployment:
Hadoop
• Component
placement
determina@on
• Cleanup
target
directories
from
previous
deployment
• Create
target
directories
(log,
storage,
pid
…)
• Synchronize
order
of
component
start-‐up
Namenode
ResourceManager
SparkMaster
DataNode
NodeManager
SparkExecutor
DataNode
NodeManager
SparkExecutor
DataNode
NodeManager
SparkExecutor
1st
node
in
PBS_NODEFILE
2nd
node
in
PBS_NODEFILE
3rd
node
in
PBS_NODEFILE
4th
node
in
PBS_NODEFILE
5th
node
in
PBS_NODEFILE
nth
node
in
PBS_NODEFILE
• Addi@onal
components
(Hbase,
Hive,
Ka_a
…)
can
be
added
to
this
deployment
model
13. Deploying
Hadoop
Ecosystem
vs.
deploying
HPCC
Systems:
Deployment:
HPCC
Systems
• Determine
node
alloca@on
and
internal
IP
addresses
• HPCC
Systems
is
configured
via
its
own
deployment
programs
(configmgr,
configgen,
hpcc-‐init)
1st
node
in
PBS_NODEFILE
2nd
node
in
PBS_NODEFILE
1st
node
in
PBS_NODEFILE
3rd
node
in
PBS_NODEFILE
4th
node
in
PBS_NODEFILE
5th
node
in
PBS_NODEFILE
nth
node
in
PBS_NODEFILE
14. Deploying
Hadoop
Ecosystem
vs.
deploying
HPCC
Systems:
Deployment:
HPCC
Systems
• Node
memory
constraints
• HPCC
Systems
reserves
75%
of
available
memory
for
thor
by
default
• Palmeko
does
not
allow
unlimited
memory
reserva@on
• As
a
result,
thor_master
cannot
launch
new
jobs
via
fork()
• Resolved
by
lower
memory
reserva@on
1st
node
in
PBS_NODEFILE
2nd
node
in
PBS_NODEFILE
1st
node
in
PBS_NODEFILE
3rd
node
in
PBS_NODEFILE
4th
node
in
PBS_NODEFILE
5th
node
in
PBS_NODEFILE
nth
node
in
PBS_NODEFILE
15. Lessons
Learned
• A
common
approach
can
be
adapted
for
both
Hadoop
Ecosystem
and
HPCC
Systems
• Limita@ons
on
non-‐administra@ve
accounts
can
impact
the
deployment
and
performance
via
system
resource
constraints
– Unable
to
u@lize
all
available
memory
on
allocated
node
(HPCC
Systems)
• Dynamic
deployment
via
non-‐administra@ve
accounts
provide
ini@a@ve
for
users
to
experiment
with
and
u@lize
new
large
scale
frameworks
without
addi@onal
burden
for
administrators
16. Lessons
Learned
• Experience
in
deploying
as
users
is,
in
turn,
extremely
applicable
to
the
process
of
deployment
with
administra@ve
privileges.
• E.g.:
CloudLab
cloud
compu@ng
experimental
testbed
with
non-‐persistent,
ephemeral,
and
short-‐term
(15
hours)
alloca@on
– Script-‐based
installa@on
and
deployment
are
needed,
even
with
administra@ve
right,
to
automate
the
deployment
of
the
experiment
• Experience
in
deploying
as
administrators
is
helpful
in
debugging
user-‐
based
deployment:
– Iden@fica@on
and
resolu@on
of
memory
alloca@on
issue
in
HPCC
Systems
were
done
by
changing
system
limita@on
using
administra@ve
commands.
17. QUESTIONS?
Linh B. Ngo1 Michael E. Payne1 Flavio Villanustre2 Richard Taylor2 Amy W. Apon1
{lngo,mpayne3,aapon}@clemson.edu
1School of Computing, Clemson University
{flavio.villanustre,richard.taylor}@lexisnexis.com
2LexisNexis Risk Solutions
More information about HPCCSystems can be found at http://hpccsystems.com