Boost PC performance: How more available memory can improve productivity
PEARC 17: Spark On the ARC
1. Spark on the ARC
Big data analytics frameworks on HPC Cluster
Mark E. DeYoung
Himanshu Bedi
Mohammed Salman
Dr. David Raymond
Dr. Joseph G. Tront
2. Agenda
• MOTIVATION
• HPC vs HADOOP
• Spark on the ARC
• VT ARC
• Current Work
• Implementation Details
• Results
• Future Work
3. Log Archiving and Analysis (LAARG)
• Ingestion of Logs into Elastic Search
• Setting up of Infrastructure for Big Data Analytics
• Machine Learning Algorithms on Network Logs
4. Network Security Data Analytics
• Application of data science
approaches to the problem
domain of network security
• Solution domains:
• Data Mining (DM)
• Machine Learning (ML)
• Natural Language Processing
(NLP)
• ML - Requires underlying
analytic infrastructure
capable of efficiently applying
ML algorithms (w/ fine grained parallelism)
to big data (w/ coarse grained parallelism)
5. Parallelism & Data Separability
• Parallelism – doing lots of things at the same time :
• Granularity:
• Fine – large number of small tasks. Requires low-latency communication and frequent
coordination
• Coarse – small number of large tasks. Requires high throughput communication and less
frequent coordination
• Levels: Instruction (ILP), Task (TLP), Data (DLP), Transaction
• Ability to achieve parallelism is shaped by properties of data (and the
algorithms used to process the data).
• Data separability – data co-dependency determines ability to segment
data
• Uniform - identically sized segments
• Modular - a priori segmentation according to some extrinsic knowledge about the data
• Arbitrary - arbitrarily sized segments
• Generalizations:
• Machine Learning – fine grained, data has low temporal and/or spatial separability.
• Big Data – coarse grained, data has high temporal and/or spatial separability
• Machine Learning from Big Data – we need both fine and coarse grained
parallelism to process uniform and modular data.
6. HPC vs HADOOP
Point of Interest HPC HADOOP
Data Storage &
Processing
Centralized Storage Data is distributed across
nodes.
Hardware
Infrastructure
Requires high end
computing components
Runs mostly on commodity
hardware
FS Storage Lustre File System HDFS file system
Cluster Resource
Management
SLURM/TORQUE YARN/Mesos
Programming
Languages
Uses second generation
languages like C/C++
Uses latest programming
languages like
Java/Scala/Python
Business
Applications
Primary use for
scientific research
Commercial Analytics use-
cases
7. What is Spark ?
• In-memory data
processing
engine
• Interface for
programming
clusters with
data parallelism
• Facilitates
iterative
algorithms (good for ML)
& interactive
exploratory data
analysis (EDA)
• Source – www.databricks.com
8. Spark on the ARC
• ARC environment can support both fine and coarse grained
parallelism necessary for machine learning from big data.
• Spark provide a framework to orchestrate algorithm execution and
distributed data management (programming model).
• Pros:
• Don’t need dedicated hardware, don’t need to sys admin the
hardware
• Deployment is fairly straight forward – script based at the moment.
• Spark (unlike MPI) provides distributed data model – Resilient
Distributed Datasets (RDDs); Datasets and Data Frames
• Cons:
• ARC is batch oriented - not appropriate for long running services,
interactive, or incremental/streaming tasks
• Shared resource – might have to wait for specific computer type
• Loss of control – uptime and maintenance actions controlled by
ARC
9. VT ARC
• Set of HPC Clusters: New River, Cascades, Dragons Tooth
• Several compute node configurations:
• Processing: multi-core, multi-CPU, some with GPU
• Storage:
• Node local – HDD, SSD, NVME, memory
• Centralized Storage on IBM General Parallel File System (GPFS)
• Interconnect: low-latency 100 Gbps (MPI), throughput oriented 10 Gps
(data movement)
• Resource Management / Scheduling:
• TORQUE Resource Manager - modified open source Portable
Batch System(PBS)
• MOAB Cluster Workload Management – billing for allocations
• Environment configuration: Lmod environmental modules
system
10. Current Work
• Developed and evaluated deployment models of the Apache Hadoop
and Spark frameworks on existing batch oriented HPC clusters.
• Created a framework to automate the creation of deployment
variations and monitor the execution of evaluation iterations that
accommodates dynamic resource allocations.
11. Batch Job with 3 compute nodes
• Figure 1 shows the Hadoop
Namenode (NN) and YARN
ResourceManageer (RM)
service daemons running
on the head node.
• Figure 2 shows the Hadoop
Datanode (DN) and YARN
NodeManager (NM) running
on each of the worker
compute-nodes allocated to
the job.
13. Evaluation
• Evaluation was carried out on two clusters maintained by VT’s ARC, namely
Cascades and NewRiver.
• A dynamic Spark and Hadoop cluster is instantiated and the scheduling is
carried out in both the standalone mode and with YARN.
• Two benchmarks – namely Spark Bench and HiBench were run to test the
Spark and Hadoop configurations.
• Collected telemetry data from the telemetry framework provided by VT ARC
as part of TORQUE/Moab installation. This data includes queuing delay, time
to completion, CPU utilization and memory consumption.
• Investigated the effects of horizontal scaling versus vertical scaling by
comparing the resource utilization in either case.
15. Future Work
• Examining overhead incurred in allocation of resources
from the HPC scheduler.
• Evaluate the impact of user contention when the compute
nodes are shared between users.
• Run realistic, real-work workloads, primarily running
Machine Learning algorithms on network logs collected
from the access points distributed across the campus.
• Analyzing the performance of this framework for
streaming data.