DevoxxFR 2024 Reproducible Builds with Apache Maven
Final apu13 phil-rogers-keynote-21
1. THE PROGRAMMER’S GUIDE TO REACHING FOR THE CLOUD
PHIL ROGERS, CORPORATE FELLOW, AMD
NOV. 11, 2013
2. MODERN CLOUD WORKLOADS ARE HETEROGENEOUS
SCALAR CONTENT WITH A GROWING MIX OF PARALLEL CONTENT
Video is expected to represent two thirds of mobile data traffic by 2017
‒ Video is continuously being captured, uploaded, transcoded and streamed
‒ Video processing is inherently parallel … and can be accelerated
Big data growing exponentially with Exabytes of data crawled monthly
‒ Indexing the web and extracting high definition information
‒ Map reduce is a heterogeneous workload
Natural User Interfaces are still in their infancy
‒ Accurate extraction of meaning from gesture and voice
‒ Getting to the fingertips and voice inflections
NEED TO SIMULTANEOUSLY
INCREASE PERFORMANCE AND
REDUCE POWER
2 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
3. FUTURE TECHNOLOGY GROWTH WILL ACCELERATE THE TREND
Rapid growth of Sensor Networks
RAPID GROWTH OF THE NUMBER OF THINGS
CONNECTED TO THE INTERNET
‒ Drives exponential increase in data
Internet of Everything (IoE) results
in explosion of data sources
“Fixed” Computing
(you go to the device)
Mobility / BYOD
(the device goes with
you)
Internet of Things
(age of devices)
HOW MUCH VALUE IS AT
STAKE IN THE IOE ECONOMY?
Internet of Everything
(people, process, data,
things)
$14.4
trillion
50B
‒ Another exponential growth in data
at local and cloud level
Context Aware Computing is a
Huge Big-Data Problem
$9.5
$4.9
trillion
‒ Both local and cloud compute must
get faster/lower power
1995
2000
2005
2010
2015
2020
trillion
From
industry-specific
use cases
From
cross-industry
use cases
DRIVING FUTURE DEMAND FOR LOCAL AND CLOUD PARALLEL EFFICIENCY
3 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
Source: Cisco IBSG, 2013
4. HSA APU PROCESSORS OPERATE HARMONIOUSLY AT LOW POWER
EXAMPLE: VIDEO ENHANCEMENT
Techniques include:
‒ Image Stabilization, Super Resolution, Deblur, Deinterlace, Lighting & Contrast
Enhancements examine pixels from a large number of video frames
‒ Super-resolution based on information from surrounding frames
Algorithms can be run on multiple processors in the APU
‒ CPU, GPU, DSPs, Fixed Function Accelerators
‒ Convolutions, motion estimation, histograms,
format conversions, etc.
‒ Processing flows freely between processors
for best efficiency
4 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
5. HETEROGENEOUS PROCESSORS - EVERYWHERE
SMARTPHONES TO SUPER-COMPUTERS
Super computer
Dense Server
Tablet
Phone
Workstation
Notebook
A SINGLE SCALABLE ARCHITECTURE
FOR THE WORLD’S PROGRAMMERS
IS DEMANDED AT THIS POINT
5 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
6. HOW DOES HSA MAKE THIS ALL WORK?
Enables acceleration of languages like Java, C++ AMP and Python
All processors use the same addresses, and can share data structures in place
Heterogeneous computing can use all of virtual and physical memory
Extends multicore coherency to the GPU and other processors
Pass work quickly between the processors
Enables quality of service
HSA FOUNDATION – BUILDING
THE ECOSYSTEM
6 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
8. HSA FOUNDATION AT LAUNCH
BORN IN JUNE 2012
Founders
8 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
9. HSA FOUNDATION TODAY – NOVEMBER 2013
A GROWING AND POWERFUL FAMILY
Founders
Promoters
Supporters
Contributors
TBA at APU-13
Universities
9 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
NTHU Programming
Language Lab
NTHU System
Software Lab
COMPUTER SCIENCE
10. HSA FOUNDATION PROGRESS
WHAT AN AMAZING FIRST YEAR
Membership growing rapidly
‒ 2-3 new members per month
‒ Universities enrolling
Four working groups generating specifications
‒ HSA Programmers Reference Manual published
‒ HSA System Architecture spec going to ratification by the
end of the year
‒ Runtime WG and Tools WG will publish early next year
HSA Development platforms to ship in early 2014
10 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
11. PROGRAMMING LANGUAGES PROLIFERATING ON HSA
OpenCL™
App
Java App
C++ AMP
App
Python
App
OpenCL
Runtime
Java JVM
(Sumatra)
Various
Runtimes
Fabric
Engine RT
HSAIL
HSA
Helper Libraries
HSA Core
Runtime
Kernel Fusion
Driver (KFD)
11 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
HSA
Finalizer
13. HIGH EFFICIENCY VIDEO CODEC – HEVC (H.265)
VALUE PROPOSITION
HEVC VISUAL QUALITY IS
SIGNIFICANTLY BETTER THAN
H.264 AT ANY GIVEN BIT RATE
30% TO 50% MORE EFFICIENT
THAN H.264 AT 1080P RESOLUTION
4K Ultra HDTV
Sony XBR
$4999
H.265 @ 500 kbps
H.264 @ 500 kbps
13 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
4K VIDEO BENEFITS ARE EVEN
MORE SIGNIFICANT WITH HEVC
30% to 50%
4K Video Cameras
GoPro
$399
14. HIGH EFFICIENCY VIDEO CODEC – HEVC (H.265)
WHY HEVC WILL PROLIFERATE
The next generation MPEG video encoding standard
Significantly higher efficiency (up to 50% lower bit
rates at given quality) than AVC (H.264)
Highly beneficial for HD video (1080p or below)
Especially beneficial for 4K video
Scales to 8K Ultra High Definition video (up to
8192×4320)
Computationally complex, but by design easier to
parallelize than H.264
Traffic Share
Mobile Video
Mobile M2M
Exabytes Per Month
12
Mobile Web/Data
Mobile File Sharing
3.5%
5.1%
10
24.9%
8
6
4
66.5%
2
CLOUD VIDEO PROVIDERS NEED THE HIGHER
COMPRESSION FOR QUALITY OF SERVICE
14 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
0
2012
2013
2014
2015
2016
2017
Source: Cisco VNI Mobile Forecast, 2013
15. HEVC (H.265) ACCELERATION
EFFICIENT CLOUD DEPLOYMENT
ALL STAGES OF HEVC ARE
ACCELERATED ON THE APU
Decrypt
Decode and decompress
Scaling and Enhancement
Encode and compress
Encrypt
15 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
ENCODE IS THE HEAVIEST
STAGE
H.265 ENCODING IS 5 – 10X MORE
COMPUTATIONALLY COMPLEX THAN H.264
Leverage point for
compression
Highly parallel
Algorithms improve
monthly
Must stay programmable
Picture can be divided
into Macroblock
regions with a much
wider range of sizes
and shapes
Motion vectors have
33 prediction
directions compared
to 8 for H.264
16. OVERVIEW OF B+ TREES
B+ Trees are a special case of B Trees
A B+ Tree …
‒ is a dynamic, multi-level index
‒ Is efficient for retrieval of data, stored in a block-oriented
context
Fundamental data structure used in several
popular database management systems
‒ SQLite
‒ CouchDB
Order (b) of a B+ Tree measures the capacity of its nodes
3
2
5
4
6
7
1
2
3
4
5
6
7
8
d1
d2
d3
d4
d5
d6
d7
d8
16 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
17. APPLICATIONS THAT USE B/B+ TREES
primary data store on the clientside
multi-data center key-value store
Mail, Safari, iPhone, iPod, iTunes
market-data framework
Firefox and Thunderbird
large hadron collider
Android, Chrome
17 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
http://www.sqlite.org/famous.html
http://wiki.apache.org/couchdb/CouchDB_in_the_wild
18. HOW WE ACCELERATE
Utilize coarse-grained parallelism in B+ Tree searches
‒ Perform many queries in parallel
‒ Increase memory bandwidth utilization with parallel reads
‒ Increase throughput (transactions per second for OLTP)
B+ Tree searches on an HSA enabled APU
‒ Allows much larger B+ Trees to be searched, than traditional GPU compute
‒ Eliminates data-copies since CPU and GPU cores can access the same memory
18 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
19. RESULTS
1M search queries in parallel
7
Input B+ Tree contains 112 million
keys and uses 6GB of memory
Software: OpenCL on HSA
5
Speedup
Hardware: AMD “Kaveri” APU
with Quad Core CPU and 8 GCN
Compute Units at 35W TDP
6
4
3
2
1
0
8
16
32
64
128
Order of B+ Tree
Baseline: 4-core OpenMP + hand-tuned SSE CPU implementation
19 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
Results measured in AMD Labs on “Kaveri” APU, 35W TDP, 16GB DRAM
20. REVERSE TIME MIGRATION (RTM)
Land crews
A technique for creating images based on
sensor data to improve seismic interpretations
done by geophysicists
Marine crews
A memory-intensive and highly parallel
algorithm
RTM is run on massive data sets
A natural scale out algorithm
Often run today on 100K node CPU systems
Bringing this to HSA and APU based
supercomputing will increase performance for
current sensor arrays, and allow more sensors
and accuracy in the future.
20 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
HOWEVER, SPEED OF PROCESSING AND
INTERPRETATION IS A CRITICAL
BOTTLENECK IN MAKING FULL USE
OF ACQUISITION ASSETS
21. TEXT ANALYTICS – HADOOP TERASORT AND BIG DATA SEARCH
MINING BIG DATA
Multi-stage pipeline or parallel
processing stages
Traditional GPU Compute is challenged
by copies
Input HDFS
sort
split 0
map
Sort
Compression
Regular expression parsing
CRC generation
Acceleration of large data search scales
out across the cluster of APU nodes
21 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
Output HDFS
merge
reduce
split 1
split 2
part 0
HDFS
Replication
reduce
APU with HSA accelerates each stage in
place
‒
‒
‒
‒
copy
part 1
HDFS
Replication
map
map
23. PROGRAMMING MODELS EMBRACING HSAIL AND HSA
THE RIGHT LEVEL OF ABSTRACTION
UNDER DEVELOPMENT
Java: Project Sumatra OpenJDK 9
OpenMP from SuSE
C++ AMP, based on CLANG/LLVM
Python and KL from Fabric Engine
23 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
NEXT
DSLs: Halide, Julia, Rust
Fortran
JavaScript
Open Shading Language
R
24. HSA ENABLES DEVELOPERS TO LEVERAGE HC … EASILY & NATURALLY
PREFERRED PROGRAMMING
LANGUAGES
TRANSPARENT CALLS TO POPULAR
LIBRARIES
Java, C++, OpenMP, Python *
OpenCV, SciPy, NumPy,
ImageMagick, Bolt, …
SVM, Coherence, GPU Enqueue
OpenJDK/Sumatra, Fabric
Engine
Arbitrary data structures, SVM,
Coherence, User mode
queueing
OpenCV API, Bolt STL library
* Java 8, C++ AMP, OpenMP 4.0 next generation standards and extensions
24 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
USING CONVENTIONAL
METHODS
Arbitrary data structures,
malloc, function pointers, callbacks, recursion,
semaphores, atomics
SVM, Coherence, User-mode
queueing, GPU Enqueue, HSAIL
Linked-list/tree traversal +
other complex shared host data
structures
25. C++ AMP ACCELERATION GOES MULTI-PLATFORM
Herb Sutter Announced C++ AMP for the Windows® Platform at ADS 2011
We very much liked the single source model of development, and decided to extend it
to be multi-platform
Today we are announcing C++ AMP is moving beyond Microsoft® Windows to embrace
Linux. We will offer this acceleration on both our APUs and our discrete GPUs
We are also bringing Bolt STL Library support to C++ AMP
C++AMP
25 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
CLANG Front-end
LLVM-IR or
SPIR 1.2
Any HSA
Implementation
SPIR 1.2
AVAILABLE IN
OPEN SOURCE
1H-2014
HSAIL
Any OpenCL™+SPIR
Implementation
LLVM Compiler
26. HSA ENABLEMENT OF JAVA
JAVA 7 – OpenCL ENABLED APARAPI
JAVA 8 – HSA ENABLED APARAPI
JAVA 9 – HSA ENABLED JAVA (SUMATRA)
AMD initiated Open Source project
Java 8 brings Stream + Lambda API.
Adds native GPU acceleration to Java Virtual
Machine (JVM)
APIs for data parallel algorithms
‒ GPU accelerate Java applications
‒ No need to learn OpenCL™
Active community captured mindshare
‒ ~20 contributors
‒ >7000 downloads
‒ ~150 visits per day
‒ More natural way of expressing data parallel
algorithms
‒ Initially targeted at multi-core.
We will provide
HSA Enabled Aparapi on Java 8
APARAPI will :
to bridge between Aparapi on Java 7
‒ Support Java 8 Lambdas
‒ Dispatch code to HSA enabled devices at 9
and HSA/Sumatra on Java
runtime via HSAIL
Java Application
Developer uses JDK Lambda, Stream API
JVM uses GRAAL compiler to generate HSAIL
JVM decides at runtime to execute on either
CPU or GPU depending on workload
characteristics.
Java Application
Java Application
Java JDK Stream + Lambda API
APARAPI API
APARAPI + Lambda API
OpenCL™
OpenCL™ Compiler
& Runtime
CPU
HSAIL
HSA Finalizer
& Runtime
JVM
CPU ISA
Java GRAAL JIT
backend
HSAIL
HSA Finalizer
& Runtime
JVM
GPU ISA
GPU
26 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
CPU ISA
CPU
JVM
GPU ISA
GPU
CPU ISA
CPU
GPU ISA
GPU
27. JAVA DEMO
WELCOME GARY FROST TO THE STAGE
27 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
28. NBODY REVISTED
NBody problem:
‒ Calculate the position of ‘N’ bodies in 3D space by computing the gravitational effect each has on all
of the others and updating it’s position.
A Java sequential NBody implementation would start with an Object for each Body.
public class Body{
// State of object
private float x, y, z, m, vx, vy, vz;
// Method to update position relative to other bodies
void updatePosition(Body[] bodies){ /* code omitted */ }
}
Then we would iterate over all bodies updating the position of each
for (Body b: bodies) {
b.updatePosition(bodies)
});
A pre Java 8 Java ‘parallel’ version would not fit so nicely on this slide ;)
28 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
29. JAVA 8’S ‘PROJECT LAMBDA’ SIMPLIFIES PARALLEL PROGRAMMING
Offers an alternate syntax for processing arrays/collections of data
for (Body b; bodies)
b -> updatePosition(bodies);
Arrays.stream(bodies) // wrap array in a stream
.forEach(b -> b.updatePosition(bodies);
To process a stream in parallel we just tag the stream with the parallel() modifier
Arrays.stream(bodies) // Wrap an array in a stream
.parallel();
// tag the stream as parallel
.forEach(b -> b.updatePosition(bodies);
In Java 8 a parallel stream executes across all CPU cores.
In Java 9 (Sumatra) a parallel stream executes across all CPU and GPU cores
29 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
30. JAVA DEMO
30 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
31. JAVA AND THE CLOUD
THE RIGHT LANGUAGE WITH ACCELERATION ON CLOUD APUS
Java 8 and Java 9 provide parallel acceleration
Parallel workloads are proliferating in the cloud
Hadoop framework for scale out
HSA APUs provide workload acceleration
DON’T MISS THE KEYNOTE
TOMORROW FROM ORACLE’S
NANDINI RAMANI
“THE ROLE OF JAVA™ IN HETEROGENEOUS
COMPUTING, AND HOW YOU CAN HELP”
31 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
33. ANNOUNCING AMD’S UNIFIED SDK
Access to AMD APU and GPU programmable
components
Component installer - choose just what you need
Initial release includes:
‒ APP SDK v2.9
‒ Media SDK 1.0 Beta
AMD Unified SDK
APP SDK 2.9
MEDIA SDK 1.0 BETA
Web-based sample browser
GPU accelerated video pre/post processing library
Supports programming standards: OpenCL™, C++ AMP
Leverage AMD's media encode/decode acceleration blocks
Code samples for accelerated open source libraries:
Library for low latency video encoding
‒ OpenCV, OpenNI, Bolt, Aparapi
OpenCL™ source editing plug-in for visual studio
Now supports Cmake
33 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
Supports both Windows Store and Classic desktop
34. ANNOUNCING AMD
V1.3
AMD’s comprehensive heterogeneous
developer tool suite including:
‒ CPU and GPU Profiling
‒ GPU kernel Debugging
‒ GPU kernel analysis
New features in version 1.3:
‒ Supports Java
‒ Integrated static kernel analysis
‒ Remote debugging/profiling
‒ Supports latest AMD APU and GPU products
CPU PROFILER
GPU PROFILER
GPU DEBUGGER
STATIC KERNEL ANALYZER
Time-based profiling
OpenCL™ Application Trace
Analyze call-chain relationships
Profile OpenCL kernels
Compile, analyze and
disassemble OpenCL Kernels
Java profiling with inline
function support
Timeline visualization of GPU
counter data
Real-time OpenCL kernel
debugging with stepping and
variable display
Cache-line utilization profiling
Kernel Occupancy Viewer
Supports latest AMD processors
Remote GPU Profiling
34 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
OpenCL and OpenGL API
Statistics
Object visualization
Remote GPU debugging
View kernel compilation
errors/warnings
Estimate kernel performance
View generated ISA code
View registers
35. OPEN SOURCE LIBRARIES ACCELERATED BY AMD
OpenCV
Bolt
clMath
Aparapi
Most popular computer
vision library
C++ template library
AMD released APPML as
open source to create
clMath
OpenCL™ accelerated Java 7
Now with many OpenCL™
accelerated functions
Provides GPU off-load for
common data-parallel
algorithms
Now with cross-OS support
and improved
performance/functionality
35 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
Accelerated BLAS and FFT
libraries
Accessible from Fortran, C
and C++
Java APIs for data parallel
algorithms (no need to
learn OpenCL™
36. AMD APUS, HSA – CLIENT TO THE CLOUD
A CONVERGENCE AT THE RIGHT TIME
Parallel workloads are booming
‒ Acceleration where the data is
‒ On the client for a snappy user experience
‒ In the cloud for scalable services
HSA enabled APUs in the cloud
‒ Big data analytics
‒ Video processing
‒ Science, imaging, genomics
‒ Unleashing the Java development community
Acceleration at all tiers of the cloud
‒ Data centers, media hubs, cloud periphery
36 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC
37. A SPECIAL GUEST
Gary Campbell
Infrastructure Technology Strategy CTO
HP
37 | APU-13 KEYNOTE | NOVEMBER 11, 2013 | PUBLIC