SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Spark on the ARC
Big data analytics frameworks on HPC Cluster
Mark E. DeYoung
Himanshu Bedi
Mohammed Salman
Dr. David Raymond
Dr. Joseph G. Tront
Agenda
• MOTIVATION
• HPC vs HADOOP
• Spark on the ARC
• VT ARC
• Current Work
• Implementation Details
• Results
• Future Work
Log Archiving and Analysis (LAARG)
• Ingestion of Logs into Elastic Search
• Setting up of Infrastructure for Big Data Analytics
• Machine Learning Algorithms on Network Logs
Network Security Data Analytics
• Application of data science
approaches to the problem
domain of network security
• Solution domains:
• Data Mining (DM)
• Machine Learning (ML)
• Natural Language Processing
(NLP)
• ML - Requires underlying
analytic infrastructure
capable of efficiently applying
ML algorithms (w/ fine grained parallelism)
to big data (w/ coarse grained parallelism)
Parallelism & Data Separability
• Parallelism – doing lots of things at the same time :
• Granularity:
• Fine – large number of small tasks. Requires low-latency communication and frequent
coordination
• Coarse – small number of large tasks. Requires high throughput communication and less
frequent coordination
• Levels: Instruction (ILP), Task (TLP), Data (DLP), Transaction
• Ability to achieve parallelism is shaped by properties of data (and the
algorithms used to process the data).
• Data separability – data co-dependency determines ability to segment
data
• Uniform - identically sized segments
• Modular - a priori segmentation according to some extrinsic knowledge about the data
• Arbitrary - arbitrarily sized segments
• Generalizations:
• Machine Learning – fine grained, data has low temporal and/or spatial separability.
• Big Data – coarse grained, data has high temporal and/or spatial separability
• Machine Learning from Big Data – we need both fine and coarse grained
parallelism to process uniform and modular data.
HPC vs HADOOP
Point of Interest HPC HADOOP
Data Storage &
Processing
Centralized Storage Data is distributed across
nodes.
Hardware
Infrastructure
Requires high end
computing components
Runs mostly on commodity
hardware
FS Storage Lustre File System HDFS file system
Cluster Resource
Management
SLURM/TORQUE YARN/Mesos
Programming
Languages
Uses second generation
languages like C/C++
Uses latest programming
languages like
Java/Scala/Python
Business
Applications
Primary use for
scientific research
Commercial Analytics use-
cases
What is Spark ?
• In-memory data
processing
engine
• Interface for
programming
clusters with
data parallelism
• Facilitates
iterative
algorithms (good for ML)
& interactive
exploratory data
analysis (EDA)
• Source – www.databricks.com
Spark on the ARC
• ARC environment can support both fine and coarse grained
parallelism necessary for machine learning from big data.
• Spark provide a framework to orchestrate algorithm execution and
distributed data management (programming model).
• Pros:
• Don’t need dedicated hardware, don’t need to sys admin the
hardware
• Deployment is fairly straight forward – script based at the moment.
• Spark (unlike MPI) provides distributed data model – Resilient
Distributed Datasets (RDDs); Datasets and Data Frames
• Cons:
• ARC is batch oriented - not appropriate for long running services,
interactive, or incremental/streaming tasks
• Shared resource – might have to wait for specific computer type
• Loss of control – uptime and maintenance actions controlled by
ARC
VT ARC
• Set of HPC Clusters: New River, Cascades, Dragons Tooth
• Several compute node configurations:
• Processing: multi-core, multi-CPU, some with GPU
• Storage:
• Node local – HDD, SSD, NVME, memory
• Centralized Storage on IBM General Parallel File System (GPFS)
• Interconnect: low-latency 100 Gbps (MPI), throughput oriented 10 Gps
(data movement)
• Resource Management / Scheduling:
• TORQUE Resource Manager - modified open source Portable
Batch System(PBS)
• MOAB Cluster Workload Management – billing for allocations
• Environment configuration: Lmod environmental modules
system
Current Work
• Developed and evaluated deployment models of the Apache Hadoop
and Spark frameworks on existing batch oriented HPC clusters.
• Created a framework to automate the creation of deployment
variations and monitor the execution of evaluation iterations that
accommodates dynamic resource allocations.
Batch Job with 3 compute nodes
• Figure 1 shows the Hadoop
Namenode (NN) and YARN
ResourceManageer (RM)
service daemons running
on the head node.
• Figure 2 shows the Hadoop
Datanode (DN) and YARN
NodeManager (NM) running
on each of the worker
compute-nodes allocated to
the job.
Implementation Workflow
Evaluation
• Evaluation was carried out on two clusters maintained by VT’s ARC, namely
Cascades and NewRiver.
• A dynamic Spark and Hadoop cluster is instantiated and the scheduling is
carried out in both the standalone mode and with YARN.
• Two benchmarks – namely Spark Bench and HiBench were run to test the
Spark and Hadoop configurations.
• Collected telemetry data from the telemetry framework provided by VT ARC
as part of TORQUE/Moab installation. This data includes queuing delay, time
to completion, CPU utilization and memory consumption.
• Investigated the effects of horizontal scaling versus vertical scaling by
comparing the resource utilization in either case.
Results
Future Work
• Examining overhead incurred in allocation of resources
from the HPC scheduler.
• Evaluate the impact of user contention when the compute
nodes are shared between users.
• Run realistic, real-work workloads, primarily running
Machine Learning algorithms on network logs collected
from the access points distributed across the campus.
• Analyzing the performance of this framework for
streaming data.
THANK YOU

Weitere ähnliche Inhalte

Was ist angesagt?

Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
DataWorks Summit
 

Was ist angesagt? (20)

Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
Spark 101
Spark 101Spark 101
Spark 101
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Big Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source ToolkitsBig Data Analytics-Open Source Toolkits
Big Data Analytics-Open Source Toolkits
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 

Ähnlich wie PEARC 17: Spark On the ARC

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 

Ähnlich wie PEARC 17: Spark On the ARC (20)

Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Big Data_Architecture.pptx
Big Data_Architecture.pptxBig Data_Architecture.pptx
Big Data_Architecture.pptx
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

PEARC 17: Spark On the ARC

  • 1. Spark on the ARC Big data analytics frameworks on HPC Cluster Mark E. DeYoung Himanshu Bedi Mohammed Salman Dr. David Raymond Dr. Joseph G. Tront
  • 2. Agenda • MOTIVATION • HPC vs HADOOP • Spark on the ARC • VT ARC • Current Work • Implementation Details • Results • Future Work
  • 3. Log Archiving and Analysis (LAARG) • Ingestion of Logs into Elastic Search • Setting up of Infrastructure for Big Data Analytics • Machine Learning Algorithms on Network Logs
  • 4. Network Security Data Analytics • Application of data science approaches to the problem domain of network security • Solution domains: • Data Mining (DM) • Machine Learning (ML) • Natural Language Processing (NLP) • ML - Requires underlying analytic infrastructure capable of efficiently applying ML algorithms (w/ fine grained parallelism) to big data (w/ coarse grained parallelism)
  • 5. Parallelism & Data Separability • Parallelism – doing lots of things at the same time : • Granularity: • Fine – large number of small tasks. Requires low-latency communication and frequent coordination • Coarse – small number of large tasks. Requires high throughput communication and less frequent coordination • Levels: Instruction (ILP), Task (TLP), Data (DLP), Transaction • Ability to achieve parallelism is shaped by properties of data (and the algorithms used to process the data). • Data separability – data co-dependency determines ability to segment data • Uniform - identically sized segments • Modular - a priori segmentation according to some extrinsic knowledge about the data • Arbitrary - arbitrarily sized segments • Generalizations: • Machine Learning – fine grained, data has low temporal and/or spatial separability. • Big Data – coarse grained, data has high temporal and/or spatial separability • Machine Learning from Big Data – we need both fine and coarse grained parallelism to process uniform and modular data.
  • 6. HPC vs HADOOP Point of Interest HPC HADOOP Data Storage & Processing Centralized Storage Data is distributed across nodes. Hardware Infrastructure Requires high end computing components Runs mostly on commodity hardware FS Storage Lustre File System HDFS file system Cluster Resource Management SLURM/TORQUE YARN/Mesos Programming Languages Uses second generation languages like C/C++ Uses latest programming languages like Java/Scala/Python Business Applications Primary use for scientific research Commercial Analytics use- cases
  • 7. What is Spark ? • In-memory data processing engine • Interface for programming clusters with data parallelism • Facilitates iterative algorithms (good for ML) & interactive exploratory data analysis (EDA) • Source – www.databricks.com
  • 8. Spark on the ARC • ARC environment can support both fine and coarse grained parallelism necessary for machine learning from big data. • Spark provide a framework to orchestrate algorithm execution and distributed data management (programming model). • Pros: • Don’t need dedicated hardware, don’t need to sys admin the hardware • Deployment is fairly straight forward – script based at the moment. • Spark (unlike MPI) provides distributed data model – Resilient Distributed Datasets (RDDs); Datasets and Data Frames • Cons: • ARC is batch oriented - not appropriate for long running services, interactive, or incremental/streaming tasks • Shared resource – might have to wait for specific computer type • Loss of control – uptime and maintenance actions controlled by ARC
  • 9. VT ARC • Set of HPC Clusters: New River, Cascades, Dragons Tooth • Several compute node configurations: • Processing: multi-core, multi-CPU, some with GPU • Storage: • Node local – HDD, SSD, NVME, memory • Centralized Storage on IBM General Parallel File System (GPFS) • Interconnect: low-latency 100 Gbps (MPI), throughput oriented 10 Gps (data movement) • Resource Management / Scheduling: • TORQUE Resource Manager - modified open source Portable Batch System(PBS) • MOAB Cluster Workload Management – billing for allocations • Environment configuration: Lmod environmental modules system
  • 10. Current Work • Developed and evaluated deployment models of the Apache Hadoop and Spark frameworks on existing batch oriented HPC clusters. • Created a framework to automate the creation of deployment variations and monitor the execution of evaluation iterations that accommodates dynamic resource allocations.
  • 11. Batch Job with 3 compute nodes • Figure 1 shows the Hadoop Namenode (NN) and YARN ResourceManageer (RM) service daemons running on the head node. • Figure 2 shows the Hadoop Datanode (DN) and YARN NodeManager (NM) running on each of the worker compute-nodes allocated to the job.
  • 13. Evaluation • Evaluation was carried out on two clusters maintained by VT’s ARC, namely Cascades and NewRiver. • A dynamic Spark and Hadoop cluster is instantiated and the scheduling is carried out in both the standalone mode and with YARN. • Two benchmarks – namely Spark Bench and HiBench were run to test the Spark and Hadoop configurations. • Collected telemetry data from the telemetry framework provided by VT ARC as part of TORQUE/Moab installation. This data includes queuing delay, time to completion, CPU utilization and memory consumption. • Investigated the effects of horizontal scaling versus vertical scaling by comparing the resource utilization in either case.
  • 15. Future Work • Examining overhead incurred in allocation of resources from the HPC scheduler. • Evaluate the impact of user contention when the compute nodes are shared between users. • Run realistic, real-work workloads, primarily running Machine Learning algorithms on network logs collected from the access points distributed across the campus. • Analyzing the performance of this framework for streaming data.