Matrix factorization (MF) is widely used in recommendation systems. We present cuMF, a highly-optimized matrix factorization tool with supreme performance on graphics processing units (GPUs) by fully utilizing the GPU compute power and minimizing the overhead of data movement. Firstly, we introduce a memory-optimized alternating least square (ALS) method by reducing discontiguous memory access and aggressively using registers to reduce memory latency. Secondly, we combine data parallelism with model parallelism to scale to multiple GPUs.
Results show that with up to four GPUs on one machine, cuMF can be up to ten times as fast as those on sizable clusters on large scale problems, and has impressively good performance when solving the largest matrix factorization problem ever reported.
2. • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal
without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction
and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or
legal obligation to deliver any material, code or functionality. Information about potential future
products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our
products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a
controlled environment. The actual throughput or performance that any user will experience will vary
depending upon many factors, including considerations such as the amount of multiprogramming in the
user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated
here.
Please Note:
2
3. Background: Apache Spark and MLlib
• Apache Spark
An in memory engine for large-scale data processing
Used in database, stream, machine learning and graph
processing
2
iter. 1 iter. 2 . . .
Input
4. Background: Apache Spark and MLlib
3
Classification
(LR, SVM…) Trees Recommendation Clustering … …
5. Background: GPU computing
4
Xeon e5 2687 CPU Tesla K40 GPU
• Slower clock, fewer cache:
not optimized for latency
• More transistors to
compute
• Higher flops and memory
bw
• Optimized for data-parallel,
high-throughput workload
GPU is with:
6. Background: Apache Spark and MLlib
5
Classification
(LR, SVM…) Trees Recommendation Clustering … …
+ (GPU) connectors and libs?
7. Problem: large-scale matrix factorization
• Why
Recommendation important in
cognitive applications
Digital ads market in US: 37.3 b*:
Spark/Facebook/IBM Commerce
Need a fast and scalable solution
6
8. Problem: large-scale matrix factorization
• Why
–Factorize the word co-occurrence
matrix as rating matrix
–Obtain word features that embeds
semantics
7
man – woman =
king – queen =
brother – sister ….
9. MF: the state-of-art
• Many systems optimized for medium-
sized problems; very few target at
huge problems.
• Distributed solutions are slow.
Do not roofline CPU performance
Do not optimize communication
• Distributed solutions need a lot of
resources and cost.
8
10. MF: what we what to achieve
• Scale to problems of any size.
• Fast.
• Cost-efficient.
9
11. Solution: cuMF - ALS on a machine with GPUs
• On one GPU
GPU (Nvidia K40): Memory BW: 288 GB/sec, compute: 5 Tflops
Memory slower than compute need to optimize memory access!
• The roofline model
Higher Gflops higher op intensity (more flops per byte) caching!
Operational intensity (Flops/Byte)
Gflops/s
5T
1
288G ×
17
×
12. Solution: cuMF - ALS on a machine with GPUs
• MO-ALS on one GPU: Memory-Optimized ALS
•Access many θv columns: irregular due to R’s sparseness
•Aggregate many θvθv
Ts: memory intensive
13. Solution: cuMF - ALS on a machine with GPUs
• Texture memory to smooth dis-contiguous, irregular memory access
• Register memory to hold hotspot variables
12
14. Solution: cuMF - ALS on a machine with GPUs
• On multiple GPUs
• Exploit data & model parallelism
– Data parallelism: solve using a portion of the training data
– Model parallelism: solve a portion of the model
• Exploit connection topology to minimize communication overhead
13
Data parallel
model
parallel
16. CuMF Performance
• cuMF: ALS on a single machine with 2* Nvidia K80 (4 cards)
Compared with state-of-art distributed solutions
• 6-10x as fast
• 33-100x as cost-efficient (cuMF costs $2.5 per hour on Softlayer)
Able to factorize the largest matrix ever reported
15
17. CuMF Performance
• cuMF: ALS on a machine with one GPU
4x speedup as Spark ALS accelerator
16
Spark ALS
Spark
run-time
MLlib
cuMF with Spark
cuMF
C
18. Roadmap
• Current work
Impressive acceleration of MF with GPUs on one machine
GPU acceleration techniques with model and data parallelism
Illustrated applicability of GPU acceleration to Spark/Mllib
Performance evaluations on K40, K80 GPUs, Intel and Power
• Future work
GPU acceleration of other ML algorithms in Mllib or others
Acceleration of algorithms for multiple GPUs on single and
across machines, with and without RDMA across machines
Performance evaluation on other hardware, including
• Other GPUs such as Nvidia Maxwell
• Forthcoming NVLink connectively across GPUs within a single
machine
17
20. 19
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.