July 2018 talk to SW Data Meetup by Rob Vesse, Software Engineer, Cray Inc, discussing open source technologies for data science on high performance systems (Spark, Hadoop, PyData ecosystem, containers, etc), focusing on some of the implementation and scaling challenges they face.
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
Leveraging open source for large scale analytics
1. Leveraging Open Source for Large Scale
Analytics on HPC Systems
Rob Vesse, Software Engineer, Cray Inc
2. C O M P U T E | S T O R E | A N A L Y Z E
Overview
● Background
● Challenges
● Packaging and Deployment
● Input/Output
● Scaling Analytics
● Python Data Science
● Machine Learning
Slides: https://cray.box.com/v/sw-data-july-2018
Copyright Cray Inc 2018
2
3. C O M P U T E | S T O R E | A N A L Y Z E
Legal Disclaimer
Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to
any intellectual property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to
change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced
for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising,
promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the
approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware
or software design or configuration may affect actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and
design, SONEXION, and URIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL,
CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, REVEAL,
THREADSTORM. The following system family marks, and associated model number marks, are trademarks of
Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a
sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other
trademarks used in this document are the property of their respective owners.
Copyright Cray Inc 2018
3
4. C O M P U T E | S T O R E | A N A L Y Z E
Background
● About Me
● Software Engineer in the Analytics R&D Group
● Develop hardware and software solutions across Cray's product portfolio
● Primarily focused on integrating open source software into a coherent user friendly
product
● Involved in open source for ~15 years, committer at Apache Software Foundation
since 2012, and member since 2015
● Definition - High Performance Computing (HPC)
● Any sufficiently large high performance computer
● Typically $500,000 dollars plus
● As small as 10s of nodes up to 10,000s of nodes
● Creates some interesting scaling and implementation challenges for analytics
● Why analytics on HPC Systems?
● Scale
● Productivity
● Utilization
Copyright Cray Inc 2018
4
5. C O M P U T E | S T O R E | A N A L Y Z E
Packaging and Deployment
● Challenges
● HPC Systems are highly
controlled environments
● Users are granted the
minimum permissions
possible
● Many open source packages
have extensive dependencies
or expect users to bring in
their own
Copyright Cray Inc 2018
5
6. C O M P U T E | S T O R E | A N A L Y Z E
Solution - Containers
● An easy solution right?
● HPC Sysadmins are really paranoid
● Docker still considered insecure by many
● NERSC Shifter
● A HPC centric containerizer, used on our top end systems
● Designed to scale out massively
● Forces containerized process to run as the launching users UID
● Can consume Docker images but has own image gateway and
format
● Docker
● Currently used for our cluster systems
● Eventually will be used on our next generation supercomputers
Copyright Cray Inc 2018
6
7. C O M P U T E | S T O R E | A N A L Y Z E
Containers - Shifter vs Docker
● Both are open source so why choose Docker?
● https://github.com/NERSC/shifter
● https://github.com/docker
● Docker has a far more vibrant community
● Many of its shortcomings for HPC have or are being addressed
● E.g. Container access to hardware devices like GPUs
● NVidia Docker - https://github.com/NVIDIA/nvidia-docker
● It's Open Container Initiative (OCI) compliant
● Docker can be used with other key technologies e.g.
Kubernetes
Copyright Cray Inc 2018
7
8. C O M P U T E | S T O R E | A N A L Y Z E
Orchestration
● For distributed applications we need something to tie the
containers together
● Also want to support multi-tenant isolation
● Kubernetes
● Fastest growing container orchestrator out there
● Open APIs and highly extensible
● Declaratively specify complex applications and self-service
configuration via APIs
● E.g. Deploying Apache Spark on Kubernetes using Bloomberg's
Kerberos support mods
● Biggest problem for us is networking!
Copyright Cray Inc 2018
8
9. C O M P U T E | S T O R E | A N A L Y Z E
Kubernetes Cluster Networking
● Kubernetes has a networking model that supports
customizable network providers
● Differing capabilities, bare networking through to network
traffic policy management
● E.g. isolated Tenant A from Tenant B
● Different providers use different approaches e.g.
● Flannel and Weave use VXLAN
● Cilium uses eBPF
● Calico and Romana uses static routing
● Our Aries network doesn't support VLANs and our kernel
doesn't support eBPF!
● Therefore we chose Romana
Copyright Cray Inc 2018
9
10. C O M P U T E | S T O R E | A N A L Y Z E
Input/Output Challenges
● Lots of analytics
frameworks e.g. Apache
Hadoop Map/Reduce,
Apache Spark rely on local
storage
● E.g. temporary scratch space
● BUT many HPC systems
have no local storage
Map task
thread
Block
manager
Disk
Reduce
task
threadRequest
TCP
Spark
Scheduler
Shuffle write
Shuffle read
Meta data
Copyright Cray Inc 2018
10
11. C O M P U T E | S T O R E | A N A L Y Z E
Virtual Local Storage
● tmpfs/ramfs
● Standard temporary file system for *nix OSes
● Stored in RAM
● tmpfs is preferred as can be specified with a max size
● BUT competes with your analytics frameworks for memory
● Use the systems parallel file system e.g. Lustre
● Unfortunately these aren't designed for small file IO
● Deadlocks the metadata servers causing significant slowdown for
everyone!
● Using Linux loopback mounts to solve this
● Short lived files never leave OS disk cache i.e. still in memory
● OS can flush OS disk cache as needed
Copyright Cray Inc 2018
11
12. C O M P U T E | S T O R E | A N A L Y Z E
Python Data Science
● Challenges
● Managing dependencies
● Compute nodes typically have
no external network
connectivity
● Distributed computation
● Maximising hardware
utilization for performance
Copyright Cray Inc 2018
12
13. C O M P U T E | S T O R E | A N A L Y Z E
Dependency Management
● Using Anaconda to solve this
● Have to resolve the environments up front
● Compute nodes can't access external network
● Also need to project environments onto compute nodes
as needed
● For containers use volume mounts and environment variable
injection into the container
● For standard jobs need to store environments on a file system
visible to compute nodes
Copyright Cray Inc 2018
13
14. C O M P U T E | S T O R E | A N A L Y Z E
Distributed Computation - Dask
● Distributed work
scheduling library for
Python
● Integrates with
common data science
libraries
● Numpy, Pandas,
SciKit-Learn
● Familiar Pythonic API
for scaling out
workloads
● Can be installed as part
of the Conda
environment
>>> from dask.distributed import Client
>>> client =
Client(scheduler_file='/path/to/scheduler.json')
>>> def square(x):
return x ** 2
>>> def neg(x):
return -x
>>> A = client.map(square, range(10))
>>> B = client.map(neg, A)
>>> total = client.submit(sum, B)
>>> total # Function hasn't yet completed
<Future: status: waiting, key: sum-
58999c52e0fa35c7d7346c098f5085c7>
>>> total.result() -285
>>> client.gather(A)
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Copyright Cray Inc 2018
14
15. C O M P U T E | S T O R E | A N A L Y Z E
Dask - Scheduler & Environment Setup
● Using Dask requires running scheduler and worker
processes on our compute resources
● We don't necessarily know the set of physical nodes we will get
ahead of time
● Dask provides a scheduler file mechanism for this
● Need to start a scheduler and worker on each physical
node
● We use the entry point scripts of our container images to do this
● Also need to integrate with users Conda environment
● MUST activate the volume mounted environments prior to
starting Dask
Copyright Cray Inc 2018
15
16. C O M P U T E | S T O R E | A N A L Y Z E
Maximising Performance
● To fully take advantage of HPC hardware need to use
appropriately optimized libraries
● Option 1 - Custom Anaconda Channels
● E.g. Intel Distribution for Python
● Uses Intel AVX and MKL (Math Kernel Library) underneath popular
libraries
● Option 2 - ABI Injection
● Where a library uses a defined ABI e.g. mpi4py ensure it is
compiled against the generic ABI
● At runtime use volume mounts to mount the platform specific
ABI implementation at the appropriate location
● E.g. Cray MPICH, Open MPI, Intel MPI
Copyright Cray Inc 2018
16
17. C O M P U T E | S T O R E | A N A L Y Z E
Machine Learning
● Challenges
● How do we take advantage of
both GPUs and CPUs?
● Efficiently scale out onto
distributed systems
Copyright Cray Inc 2018
17
18. C O M P U T E | S T O R E | A N A L Y Z E
GPUs vs CPUs
● GPUs typically best suited
to training models
● More time and resource
intensive
● CPUs typically best suited
to inference
● i.e. Make predictions using a
trained model
● Need different hardware optimisations for each
● Don't necessarily know where our code will run ahead of time
● Therefore compile separately for each environment and
select desired build via container entry point script
● This requires a container runtime that supports GPUs e.g. Shifter or
NVidia Docker
● NB - We're trading off image size for performance
Copyright Cray Inc 2018
18
19. C O M P U T E | S T O R E | A N A L Y Z E
Distributed Training
● Framework support for
distributed training is not
well optimized
● Typically TCP/IP based
protocols e.g. gRPC
● Esoteric to configure
● Want to utilize full
capabilities of the network
● Uber's Horovod
● https://github.com/uber/horovod
● Uses MPI to better leverage the
network (Inifiniband/RoCE)
● Minor changes needed to your
ML scripts
● Interleaves computation and
communication
● Uses more efficient MPI
collectives where possible
Copyright Cray Inc 2018
19
20. C O M P U T E | S T O R E | A N A L Y Z E
Horovod vs gRPC Performance
https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy#slide15
Copyright Cray Inc 2018
20
21. C O M P U T E | S T O R E | A N A L Y Z E
Conclusions
● Scaling open source analytics has some non-obvious
gotchas
● Often assumes a traditional cluster environment
● Most challenges revolve around IO and Networking
● There's some promising open source efforts to solve these
more thoroughly
● Our Roadmap
● Looking to have stock Docker running on next generation
systems
● Leverage more of Kubernetes features to provide a cloud like
self service HPC model
Copyright Cray Inc 2018
21
22. C O M P U T E | S T O R E | A N A L Y Z E
Questions?
rvesse@cray.com
https://cray.box.com/v/sw-data-july-2018
23. C O M P U T E | S T O R E | A N A L Y Z E
References - Containers
Copyright Cray Inc 2018
23
Tool Project Homepage/Repository
NERSC Shifter https://github.com/NERSC/shifter
Docker https://docker.com
NVidia Docker https://github.com/NVIDIA/nvidia-docker
Kubernetes https://kubernetes.io
Flannel https://coreos.com/flannel
Weave https://www.weave.works
Cilium https://cilium.io
Calico https://www.projectcalico.org
Romana https://romana.io
24. C O M P U T E | S T O R E | A N A L Y Z E
References - Analytics & Data Science
Copyright Cray Inc 2018
24
Tool Project Homepage/Repository
Apache Hadoop https://hadoop.apache.org
Anaconda https://conda.io/docs/
Dask http://dask.pydata.org/en/latest/
NumPy http://www.numpy.org
xarray http://xarray.pydata.org/en/stable/
SciPy https://www.scipy.org
Pandas https://pandas.pydata.org
mpi4py http://mpi4py.scipy.org/docs/
Intel Distribution of Python https://software.intel.com/en-us/distribution-for-
python
25. C O M P U T E | S T O R E | A N A L Y Z E
References - Machine Learning
Copyright Cray Inc 2018
25
Tool Project Homepage/Repository
TensorFlow https://www.tensorflow.org
gRPC https://grpc.io
Horovod https://github.com/uber/horovod