More Related Content Similar to Molecular Shape Searching on GPUs: A Brave New World (20) More from Can Ozdoruk (16) Molecular Shape Searching on GPUs: A Brave New World1. FastROCS: What does it mean to be “fast”?
OpenEye Scientific Software
Brian Cole
March 26, 2013
© 2013 OpenEye Scientific Software
2. FastROCS and the “Chasm”
OpenEye Scientific Software
Brian Cole
March 26, 2013
© 2013 OpenEye Scientific Software
3. ROCS: Rapid Overlay of Chemical Structures
March 26, 2013
© 2013 OpenEye Scientific Software
5. And then you wait…
March 26, 2013
© 2013 OpenEye Scientific Software
6. High
is
Best
Shape
Overlays
per
Second
What is FastROCS?
CPU
© 2013 OpenEye Scientific Software
GPU
7. What is FastROCS?
High
is
Best
Shape
Overlays
per
Second
1,000,000
100,000
10,000
1,000
100
10
1
CPU
© 2013 OpenEye Scientific Software
GPU
8. What is FastROCS?
High
is
Best
Shape
Overlays
per
Second
600,000
500,000
400,000
300,000
200,000
100,000
0
CPU
©
2013
OpenEye
Scien;fic
So>ware
GPU
9. Low
is
Best
Log
(Elapsed
5me
in
seconds)
But I want it now!
100,000
10,000
ROCS
1,000
100
FastROCS
10
1
1
10
Log
(cores/GPUs)
March 26, 2013
© 2013 OpenEye Scientific Software
100
10. Riding Moore’s Law
High
is
Best
Shape
Overlays
per
Second
2,000,000
1,800,000
1,600,000
1,400,000
1,200,000
1,000,000
800,000
600,000
400,000
200,000
0
C1060
C2050
C2075
C2090
March 26, 2013
© 2013 OpenEye Scientific Software
K10
K20
11. ROCS user base
•
•
•
•
•
Every Pharma R&D
Many BioTechs
Many Universities
National Labs and Research Centers
Other software companies
March 26, 2013
© 2013 OpenEye Scientific Software
12. Licenses by Year
High
is
Best
2009
March 26, 2013
ROCS
FastROCS
2010
2011
© 2013 OpenEye Scientific Software
2012
13. Licenses by Year (Linear Scale)
Pharmageddon
ROCS
FastROCS
%15
2009
March 26, 2013
2010
2011
© 2013 OpenEye Scientific Software
2012
14. All ROCS users (linear scale)
Academics
ROCS
FastROCS
%3
2009
March 26, 2013
2010
2011
2012
© 2013 OpenEye Scientific Software
16. What’s in the “chasm”?
• “ROCS is already fast enough”
Some
other
;me…
• “The results aren’t bitwise comparable”
• “There’s nothing else to run on the GPU”
• “GPUs are different”
March 26, 2013
© 2013 OpenEye Scientific Software
GTC!
17. FastROCS Quick Start
•
•
•
•
•
•
crtl-alt-F1 (to switch to a non X-server terminal)
login as root
/sbin/init 3 (to turn off the X-server)
./NVIDIA-Linux-x86_64-285.05.09.run
reboot
./cuda.sh to give /dev/nvidia* correct permissions
• tar –xzf fastrocs-1.3.1-RHEL5-x64-OpenCL-1.1-CUDA-4.1.tar.gz
• openeye/bin/ShapeDatabaseServer.py database.oeb.gz
• openeye/bin/ShapeDatabaseClient.py localhost:8080 query.sdf out.sdf
March 26, 2013
© 2013 OpenEye Scientific Software
18. ROCS Quick Start
S;ll
a
barrier
to
entry
to
work
around!
• tar –xzf ROCS-3.1.1-RHEL5-x64.tar.gz
• openeye/bin/rocs query.sdf database.oeb.gz
March 26, 2013
© 2013 OpenEye Scientific Software
19. This is even worse!
fastrocs-1.3.1-RHEL5-x64-OpenCL-1.1-CUDA-4.1.tar.gz
NVidia
OpenCL
binaries
are
;ghtly
locked
to
a
par;cular
driver
version
March 26, 2013
© 2013 OpenEye Scientific Software
20. Worthwhile to upgrade
800,000
High
is
Best
Conformers
/
Second
700,000
%11
600,000
500,000
400,000
300,000
200,000
100,000
0
C2050
(260
Driver)
March 26, 2013
© 2013 OpenEye Scientific Software
C2050
(295
Driver)
21. Needed for new hardware
1,200,000
High
is
Best
Conformers
/
Second
1,000,000
800,000
600,000
400,000
200,000
0
C2050
(295
Driver)
March 26, 2013
© 2013 OpenEye Scientific Software
M2090
(295
Driver)
22. High
is
Best
Speedup
(Single
GPU
5me
/
Mul5-‐GPU
5me)
Scalability between drivers (4x C2050)
4
3
Ideal
260
driver
2
295
driver
1
1
March 26, 2013
2
3
Number
of
GPUs
© 2013 OpenEye Scientific Software
4
23. High
is
Best
Speedup
(Single
GPU
5me
/
Mul5-‐GPU
5me)
Really bad for 8x M2090
8
7
6
5
4
3
2
1
0
1
2
3
4
5
Number
of
GPUs
March 26, 2013
© 2013 OpenEye Scientific Software
6
7
8
24. Ways to transfer to device
•
CL_MEM_USE_HOST_PTR
–
•
CL_MEM_ALLOC_HOST_PTR|CL_MEM_COPY_HOST_PTR
–
•
kernelBuf = clCreateBuffer() - cacheable
ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE)
memcpy(ptr, data)
clEnqueueUnmapMemObject(ptr)
clEnqueueWriteBuffer
–
–
•
kernelBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) - cacheable
ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE)
memcpy(ptr, data)
clEnqueueUnmapMemObject(ptr)
clEnqueueMapBuffer
–
–
–
–
•
kernelBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR|CL_MEM_COPY_HOST_PTR)
CL_MEM_ALLOC_HOST_PTR
–
–
–
–
•
kernelBuf = clCreateBuffer(CL_MEM_USE_HOST_PTR)
kernelBuf = clCreateBuffer() - cacheable
clEnqueueWriteBuffer(kernelBuf, data)
oclCopyCompute
–
–
–
–
–
pinnedBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR|CL_MEM_READ_WRITE) – cacheable
pinnedPtr = clEnqueueMapBuffer(pinnedBuf, CL_MAP_WRITE) – cacheable
memcpy(pinnedPtr, data)
kernelBuf = clCreateBuffer() – cacheable
clEnqueueWriteBuffer(kernelBuf, pinnedPtr)
March 26, 2013
© 2013 OpenEye Scientific Software
25. Ways to transfer from device
•
CL_MEM_ALLOC_HOST_PTR
–
–
–
–
kernelBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR) - cacheable
ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE)
memcpy(data, ptr)
clEnqueueUnmapMemObject(ptr)
• clEnqueueMapBuffer
–
–
–
–
kernelBuf = clCreateBuffer() - cacheable
ptr = clEnqueueMapBuffer(kernelBuf, CL_MAP_WRITE)
memcpy(data, ptr)
clEnqueueUnmapMemObject(ptr)
• clEnqueueReadBuffer
– kernelBuf = clCreateBuffer() - cacheable
– clEnqueueWriteBuffer(kernelBuf, data)
•
oclCopyCompute
– pinnedBuf = clCreateBuffer(CL_MEM_ALLOC_HOST_PTR|CL_MEM_READ_WRITE) –
cacheable
– pinnedPtr = clEnqueueMapBuffer(pinnedBuf, CL_MAP_WRITE) – cacheable
– memcpy(pinnedPtr, data)
– kernelBuf = clCreateBuffer() – cacheable
– clEnqueueReadBuffer(kernelBuf, pinnedPtr)
March 26, 2013
© 2013 OpenEye Scientific Software
26. Speedup
(Time
Sequen5al
/
Time
Parallel)
FastROCS
scalability
across
8x
M2070
9
8
7
6
5
4
3
2
1
0
1
1
1
1
1
2
2
2
2
2
3
3
3
3
3
4
4
4
4
4
5
5
5
5
5
6
6
6
6
6
7
7
7
7
7
8
8
8
8
8
Number
of
GPUs
U5lized
March 26, 2013
© 2013 OpenEye Scientific Software
27. Lessons from the mess
• clEnqueueWriteBuffer > clEnqueueMapBuffer
• clEnqueueMapBuffer >> clEnqueueReadBuffer
• CL_MEM_* constants aren’t worth the effort
March 26, 2013
© 2013 OpenEye Scientific Software
28. CUDA?
• Serious customers will only use NVidia cards
• Pinned memory
• Better support for binaries and compatibility
• CUDA support >> OpenCL support
March 26, 2013
© 2013 OpenEye Scientific Software
29. FastROCS CUDA port
High
is
Best
Confomers
per
Second
3,000,000
2,500,000
2,000,000
2xC2075
1,500,000
2xC2090
1,000,000
2xK20
500,000
0
OpenCL
March 26, 2013
CUDA
© 2013 OpenEye Scientific Software
CUDA-‐
pinned
30. CUDA Scaling?
High
is
Best
Conformers
per
Second
8,000,000
7,000,000
6,000,000
5,000,000
4,000,000
CUDA
3,000,000
OpenCL
2,000,000
Ideal
1,000,000
0
1
2
3
4
5
6
7
8
Number
of
individual
K10
GPUs
(Note,
each
K10
has
2
physical
GPUs
on
the
board)
March 26, 2013
© 2013 OpenEye Scientific Software
31. CUDA vs OpenCL: Ding Ding!
• Portability vs Innovation
• NVidia vs Intel and AMD
• Open vs Proprietary
• Customers don’t care…
March 26, 2013
© 2013 OpenEye Scientific Software
32. ROCS Implementations
• We only care a little…
•
•
•
•
•
•
Fortran code (1995)
C code (1999)
C++ wrapper code (2003)
OpenCL code (2009)
CUDA code (2012)
C++ thread-safe code (2013)
March 26, 2013
© 2013 OpenEye Scientific Software
33. OpenEye Software
• Lots of Software
– 14 products
– 13 software libraries
• C++ (no SIMD)
– 2.5 million lines
• Python
– 416 thousand lines
• Java
– 63 thousand lines
• C#
– 38 thousand lines
©
2012
OpenEye
Scien;fic
So>ware
34. The People
10
20
Programmers
Hardcore
Scripter
Other
stuff
12
• GPGPU = ½ of a developer
– Only %2.5 of development effort
© 2012 OpenEye Scientific Software
38. I Believe…
• GPGPU computing can become ubiquitous…
• By expressing parallelism everywhere…
• We can make it easy for our customers…
– Pre-installed in every operating system
– Integrated seamlessly into every language
– Then eventually becoming the CPU
March 26, 2013
© 2013 OpenEye Scientific Software
40. Father of “ROCS”
Andrew Grant
April 28th 1963
December 29th 2012
March 26, 2013
© 2013 OpenEye Scientific Software
42. DUD
Average
AUC
Dude, where’s my color?
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Shape
Only
With
Color
ROCS
March 26, 2013
FastROCS
© 2010 OpenEye Scientific Software
43. 0
March 26, 2013
© 2010 OpenEye Scientific Software
Kendall
Tau
Correla5on
Coefficient
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0.40
0.35
0.30
0.25
0.20
0.15
0.10
Number
of
Targets
ROCS vs FastROCS Histogram
12
10
8
6
4
2