SlideShare ist ein Scribd-Unternehmen logo
1 von 33
HPAT.jl - Easy and Fast Big Data
Analytics
1
Ehsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana Shpeisman
Parallel Computing Lab, Intel Labs
*Intern from UIUC
JuliaCon 2016
High Performance Analytics Toolkit (HPAT): a compiler-based
framework for big data analytics and machine learning
• Goal: efficient large-scale analytics without sacrificing
programmer productivity
• Array-style programming
• High performance
• Built on ParallelAccelerator
• Domain-specific compiler heuristics for parallelization
• Use efficient HPC stack (e.g. MPI)
• Bridge the enormous productivity-performance gap
2
HPAT overview
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
Logistic regression example:
3
Let’s do Data Science
function logistic_regression(iterations)
points = …some small data…
responses = …
D = size(points,1)
N = size(points,2)
labels = reshape(responses,1,N)
w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations
w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points'
end
return w
end
gepsoft.com
• Challenges:
• Long execution time
• Data doesn’t fit in memory
• Solution:
• Parallelism: cluster or cloud
• How:
• MPI/C++ (”gold standard”), MPI/Julia, Spark (library)
• HPAT (“smart compiler”)
4
What about large data?
udel.edu
5
MPI/C++ (“gold standard”)
herr_t ret;
// set up file access property list with parallel I/O access
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
assert(plist_id != -1);
// set parallel access with communicator
ret = H5Pset_fapl_mpio(plist_id, comm, info);
assert(ret != -1);
// open file
file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id);
assert(file_id != -1);
ret=H5Pclose(plist_id);
assert(ret != -1);
// open dataset
dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT);
assert(dataset_id != -1);
hid_t space_id = H5Dget_space(dataset_id);
assert(space_id != -1);
int num_dims = 0;
num_dims = H5Sget_simple_extent_ndims(space_id);
assert(num_dims==1);
/* get data dimension info */
hsize_t sptprice_size;
H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL);
hsize_t my_sptprice_size = sptprice_size/mpi_nprocs;
hsize_t my_start = my_sptprice_size*mpi_rank;
my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double));
// create a file dataspace independently
hid_t my_dataspace = H5Dget_space(dataset_id);
assert(my_dataspace != -1);
// stride and block are NULL for contiguous hyperslab
ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL,
&my_sptprice_size, NULL);
assert(ret != -1);
...
• Message Passing Interface (MPI)
• Pros:
• Best performance!
• Cons:
• Need to understand parallelism,
MPI, parallel I/O, C++
• High effort, tedious, error-prone,
not readable…
Example code (not readable):
6
MPI/Julia
• MPI.jl
• Pros:
• Less effort than C++
• Cons:
• Still need to understand parallelism,
MPI
• Parallel I/O
• Needs high performance Julia
• Infrastructure challenges
Example code:
function main()
MPI.Init()
…
MPI.Allreduce!(send_arr, recv_arr, MPI.SUM,
MPI.COMM_WORLD)
…
MPI.Finalize()
end
7
Apache Spark
• Map/reduce library
• Master-executer model
• Pros:
• Easier than MPI/C++
• Lots of “system” features
• Cons:
• Development effort
• Very slow
• 100x slower than MPI/C++
• Host language overheads, loses
locality, high library overheads, etc.
Infoobjects.com
Example code (Python):
if __name__ == "__main__":
sc = SparkContext(appName="PythonLR")
points = sc.textFile(file).mapPartitions(readPointBatch).cache()
w = 2 * np.random.ranf(size=D) - 1
def gradient(matrix, w):
Y = matrix[:, 0]
X = matrix[:, 1:]
return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y):
x += y
return x
for i in range(iterations):
w -= points.map(lambda m: gradient(m, w)).reduce(add)
8
HPAT (“smart compiler”)
• Julia code → MPI/C++ “gold
standard”
• “Smart parallelizing compiler”
doesn’t exist, but…
• Observations:
• Array code is implicitly parallel
• Parallelism is simple for data
analytics
• map/reduce pattern
• 1D decomposition, allreduce
communication
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
points
labels
1D decomposition
points
labels
w
w
using HPAT
@acc hpat function logistic_regression(iterations, file)
points = DataSource(Matrix{Float64},HDF5,"/points", file)
responses = DataSource(Vector{Float64},HDF5,"/responses",file)
D = size(points,1)
N = size(points,2)
labels = reshape(responses,1,N)
w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations
w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points'
end
return w
end
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
9
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup over
Spark
Parallel
I/O
double* logistic_regression(int64_t iterations, char* file)
{
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
int mystart = mpi_rank∗(N/mpi_nprocs);
int myend = (mpi_rank+1)∗(N/mpi_nprocs);
double *points = (double*)malloc((myend-mystart)*D*sizeof(double));
…
ret = H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, …);
ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, …);
for(i=mystart ; i<myend ; i++) {
…
w_local = …
}
MPI_Allreduce(w, w_local, D , MPI_DOUBLE, MPI_SUM, 0 , MPI_COMM_WORLD);
return w;
}
10
Logistic Regression (generated)
Initialization,
partitioning,
allocation
Parallel I/O
Computation
Parameter
synchronization
“Bare-metal” MPI/C++ code with near-zero overhead!
11
HPAT usage
• Dependencies:
• ParallelAccelerator, Parallel HDF5,
MPI
• “@acc hpat” function annotation
• Use matrix/vector operations,
comprehensions
• ParallelAccelerator operations
• No “bad” for loops
• Column-major matrices
• All of program inside HPAT
• I/O using DataSource
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
points
labels
1D decomposition
points
labels
w
w
12
HPAT limitations
• HPAT can fail to parallelize!
• Limitations in compiler analysis
• Needs good coding style
• Fallback: explicit map/reduce, @parfor code
• Only map/reduce parallel pattern supported
• Data analytics, machine learning, optimization etc.
• Others like stencils (PDEs) not supported yet
• No sparse matrices yet
• HDF5 and text file format
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
Init(state) # assume all arrays are partitioned
while isChanged(state)
inferArrayDistribution(state, node)
end
function inferArrayDistribution(state, node)
if isAssignment(node)
# lhs and rhs are sequential if either is sequential
seq = isSeq(state, lhs) || isSeq(state, rhs)
isSeq(state, lhs) = isSeq(state, rhs) = seq
elseif isGEMM(node)
# e.g. w = labels*points’ - shared parameter synchronization heuristic
isSeq(lhs) = !isSeq(in1) && !isSeq(in2) && !isTransposed(in1) && isTransposed(in2)
elseif isHPATcall(node)
handleHPATcall(node)
else
# unknown call, assume all arrays are sequential
isSeq(state, nodeArrs) = true
end
13
Array Distribution Inference
points
labels
w
14
HPAT Compiler Pipeline
Macro-Pass Domain-Pass
Distributed-Pass
HPAT Code
Generation (MPI)
Domain-IR
Julia source
Parallel-IR CGen
Julia Compiler
MPI/C++
source
Backend Compiler
(ICC/GCC)
• MacroPass
• “desugar” extensions like DataSource
• DomainPass
• Generate variables, allocations, and function calls
• Enable ParallelAccelerator pipeline
• DistributedPass
• Infer partitioned vs. sequential arrays
• Divide allocations, parallelize I/O, and computation
• “hook” distributed-memory libraries
• Backend code generation
• MPI code extension for CGen
15
HPAT Compiler Pipeline
Macro-Pass
Domain-Pass
Distributed-Pass
HPAT Code
Generation (MPI)
16
HPAT vs. Spark
47
43
84
1061
182
1.55
0.69
0.05
11.13
7.95
0.01
0.1
1
10
100
1000
10000
1D SUM 1D SUM
FILTER
MONTE
CARLO PI
LOGISTIC
REGRESSION
K-MEANS
EXECUTIONTIME(S)
Spark HPAT
1680x
95x
23x30x 62x
Cori at NERSC/LBL
64 nodes (2048 cores)
17
Strong Scaling- Logistic Regression
1061
800
634
11.13
5.89
4.11
1
2
4
8
16
32
64
128
256
512
1024
2048
64 (2k) 128 (4k) 256 (8k)
ExecutionTime(s)
Nodes (Cores)
Scaling of Spark vs. HPATSpark HPAT
95x 154x
...
kmeans::init::Distributed<step1Local,double,kmeans::init::randomDense> localInit(nClusters, nBlocks * nVectorsInBlock, rankId *
nVectorsInBlock);
localInit.input.set(kmeans::init::data, dataSource.getNumericTable());
localInit.compute();
services::SharedPtr<byte> serializedData;
InputDataArchive dataArch;
localInit.getPartialResult()->serialize( dataArch );
size_t perNodeArchLength = dataArch.getSizeOfArchive();
if (rankId == mpi_root) {
serializedData = services::SharedPtr<byte>( new byte[ perNodeArchLength * nBlocks ] );
}
byte *nodeResults = new byte[ perNodeArchLength ];
dataArch.copyArchiveToArray( nodeResults, perNodeArchLength );
MPI_Gather( nodeResults, perNodeArchLength, MPI_CHAR, serializedData.get(), perNodeArchLength, MPI_CHAR,
mpi_root, MPI_COMM_WORLD);
….
using HPAT
@acc hpat function calcKmeans(k, file)
points = DataSource(Matrix{Float64},HDF5,"/points", file)
clusters = HPAT.Kmeans(points, k)
return clusters
end
18
Machine Learning Libraries
Intel® DAAL as backend
(not readable)
• Compiler development experience
• The good:
• Built-in type inference
• Interospection/full control
• The bad:
• No detailed AST definition
• Julia compiler surprises
• Long HPAT/ParallelAccelerator compilation time!
• Type inference every time
19
Building HPAT in Julia
• Structured data processing without SQL!
• Complex analytics all in array syntax
• Inspired by TPCx-BB examples
• Array syntax for table operations
• Join, filter, aggregate
• Similar to DataFrames.jl
• Interesting compiler challenges
• Optimize general AST instead of SQL trees
• Other use cases
• 2D decomposition
20
Ongoing HPAT development
customer_i_class =
aggregate(sale_items, :ss_customer_sk,
:ss_item_count = length(:ss_item_sk),
:id1 = sum(:i_class_id==1),
:id2 = sum(:i_class_id==2),
:id3 = sum(:i_class_id==3),
:id4 = sum(:i_class_id==4))
Example:
21
Summary
• High Performance Analytics Toolkit (HPAT) provides scripting
abstractions and “bare-metal” performance
• Matrix/vector operations, extension for Parallel I/O
• Domain-specific compiler techniques
• Generates efficient MPI/C++
• Uses existing HPC libraries
• Much easier and faster than alternatives
• Get involved!
• Any contributions welcome
• Need more and more use cases
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
Backup
22
• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most common, fast development
• MPI/C++ not acceptable
• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)
• Master-executer library approach
• Library approach is slow
• Loses locality, high overheads
• Orders of magnitude slower than handwritten MPI/C++
23
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.
K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most common, fast development
• MPI/C++ not acceptable
• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)
• Master-executer library approach
• Library approach is slow
• Loses locality, high overheads
• Orders of magnitude slower than handwritten MPI/C++
24
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.
K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
• 1D partitioning of data per nodes
• Domain-specific heuristic for machine learning, analytics
• Not valid for other domains!
• Column partitioning for matrices
• Julia is column major
• Needs good coding style by user
• 1D partitioning of parfor iterations per node
• Handle distributed-memory libraries
• Generate necessary input/output transformations
• Intel® Data Analytics Acceleration Library (Intel® DAAL)
• Machine learning algorithms similar to Spark’s MLlib
25
DistributedPass: parallelization
26
Example: Distributed Data Source Transformation Flow
points = DataSource(Matrix{Float64},HDF5,"/points",“data.hdf5")
points::Matrix{Float64} = HPAT_h5_source("/points",“data.hdf5")
h5_size1, h5_size2 = HPAT_h5_sizes("/points",“data.hdf5")
points = alloc(h5_size1, h5_size2)
HPAT_h5_read(points, "/points",“data.hdf5")
# Enable Julia type inference
# Enable ParallelAccelerator
points = alloc(h5_size1, h5size_2/nprocs)
HPAT_h5_read(points,"/points",“data.hdf5“,h5_size1,h5size_2/nprocs)
# Parallelize
H5Sselect_hyperslab(h5_size1, h5size_2/nprocs,…)
H5Dread(points,"/points",“data.hdf5“,…)
# Backend code
Marco-Pass
Domain-Pass
Distributed-Pass
HPAT-CGen
• Parallel I/O example (HDF5, MPI/C++):
27
Backend Code Complexity
herr_t ret;
// set up file access property list with parallel I/O access
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
assert(plist_id != -1);
// set parallel access with communicator
ret = H5Pset_fapl_mpio(plist_id, comm, info);
assert(ret != -1);
// open file
file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id);
assert(file_id != -1);
ret=H5Pclose(plist_id);
assert(ret != -1);
// open dataset
dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT);
assert(dataset_id != -1);
hid_t space_id = H5Dget_space(dataset_id);
assert(space_id != -1);
int num_dims = 0;
num_dims = H5Sget_simple_extent_ndims(space_id);
assert(num_dims==1);
/* get data dimension info */
hsize_t sptprice_size;
H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL);
hsize_t my_sptprice_size = sptprice_size/mpi_nprocs;
hsize_t my_start = my_sptprice_size*mpi_rank;
my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double));
// create a file dataspace independently
hid_t my_dataspace = H5Dget_space(dataset_id);
assert(my_dataspace != -1);
// stride and block are NULL for contiguous hyperslab
ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL);
assert(ret != -1);
/* create a memory dataspace independently */
hid_t mem_dataspace = H5Screate_simple (1, &my_sptprice_size, NULL);
assert (mem_dataspace != -1);
/* set up the collective transfer properties list */
hid_t xfer_plist = H5Pcreate (H5P_DATASET_XFER);
assert(xfer_plist != -1);
ret = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);
assert(ret != -1);
/* read data collectively */
ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, mem_dataspace, my_dataspace, xfer_plist, my_sptprice);
assert(ret != -1);
...
Hard to write
manually!
28
Libraries
59
176
43
21
26
23
K-MEANS (LIB) LINEAR REGRESSION (LIB) NAÏVE BAYES (LIB)
EXECUTIONTIME(S)
Spark-Mllib HPAT-DAAL
2.8x
6.7x
1.9x
29
Benchmarks
• Cori at LBL/NERSC
• Dual Haswell nodes
• Cray Aries (Dragonfly) network
• 64 nodes (2048 cores) used
• Spark 1.6.0 (default Cori installation)
• Benchmarks
• 1D_sum: sums 8.5 billion element vector from file
• Pi: 1 billion random points
• Logistic regression: 2 billion samples,10 features SP
• K-Means: 320 million 20-feature DP, 10 iterations, 5 centers
D = 10 # Number of dimensions
if __name__ == "__main__":
sc = SparkContext(appName="PythonLR")
points = sc.textFile(file).mapPartitions(readPointBatch).cache()
w = 2 * np.random.ranf(size=D) - 1
def gradient(matrix, w):
Y = matrix[:, 0] # point labels (first column of input file)
X = matrix[:, 1:] # point coordinates
return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y):
x += y
return x
for i in range(iterations):
w -= points.map(lambda m: gradient(m, w)).reduce(add)
30
Logistic Regression (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py
Scheduling,
TCP/IP overheads
Python
overheads
using HPAT
@acc hpat function logistic_regression(iterations, file)
points = DataSource(Matrix{Float64},HDF5,"/points", file)
responses = DataSource(Vector{Float64},HDF5,"/responses",file)
D = size(points,1)
N = size(points,2)
labels = reshape(responses,1,N)
w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations
w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points'
end
return w
end
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
31
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup!
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 – 1
return 1 if x ** 2 + y ** 2 < 1 else 0
def add(x, y):
x += y
return x
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
sc.stop()
32
Monte Carlo Pi (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py
Scheduling overheads
Extra array
map
reduce
double calcPi(int64_t N) {
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
int mystart = mpi_rank∗(N/mpi_nprocs);
int myend = (mpi_rank+1)∗(N/mpi_nprocs);
for(i=mystart ; i<myend ; i++) {
x = rand(..);
y = rand(..);
sum_local += …
}
MPI_Reduce(…);
return out;
}
using HPAT
@acc hpat function calcPi(n)
x = rand(n) .* 2.0 .- 1.0
y = rand(n) .* 2.0 .- 1.0
return 4.0*sum(x.^2 .+ y.^2 .< 1.0)/n
end
myPi = calcPi(10^9)
$ mpirun –np 64 julia pi.jl
33
Monte Carlo Pi (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/pi.jl
1600x speedup!
Computation done
in registers

Weitere ähnliche Inhalte

Was ist angesagt?

Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
Sri Prasanna
 

Was ist angesagt? (20)

CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsEarly Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUs
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Odsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on HopsOdsc workshop - Distributed Tensorflow on Hops
Odsc workshop - Distributed Tensorflow on Hops
 
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUsScaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019 Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
Intrinsics: Low-level engine development with Burst - Unite Copenhagen 2019
 
Boost.Compute GTC 2015
Boost.Compute GTC 2015Boost.Compute GTC 2015
Boost.Compute GTC 2015
 
Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)Property-based Testing and Generators (Lua)
Property-based Testing and Generators (Lua)
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 
Performance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive ParallelismPerformance Portability Through Descriptive Parallelism
Performance Portability Through Descriptive Parallelism
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS LanguagesChapel-on-X: Exploring Tasking Runtimes for PGAS Languages
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 

Ähnlich wie HPAT presentation at JuliaCon 2016

Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
Dmitri Nesteruk
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 

Ähnlich wie HPAT presentation at JuliaCon 2016 (20)

Unmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/InvokeUnmanaged Parallelization via P/Invoke
Unmanaged Parallelization via P/Invoke
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx25-MPI-OpenMP.pptx
25-MPI-OpenMP.pptx
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Pysense: wireless sensor computing in Python?
Pysense: wireless sensor computing in Python?Pysense: wireless sensor computing in Python?
Pysense: wireless sensor computing in Python?
 
Scaling AI in production using PyTorch
Scaling AI in production using PyTorchScaling AI in production using PyTorch
Scaling AI in production using PyTorch
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop EcosystemLarge-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
Introduction to HPC Programming Models - EUDAT Summer School (Stefano Markidi...
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Максим Харченко. Erlang lincx
Максим Харченко. Erlang lincxМаксим Харченко. Erlang lincx
Максим Харченко. Erlang lincx
 
0.5mln packets per second with Erlang
0.5mln packets per second with Erlang0.5mln packets per second with Erlang
0.5mln packets per second with Erlang
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

HPAT presentation at JuliaCon 2016

  • 1. HPAT.jl - Easy and Fast Big Data Analytics 1 Ehsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana Shpeisman Parallel Computing Lab, Intel Labs *Intern from UIUC JuliaCon 2016
  • 2. High Performance Analytics Toolkit (HPAT): a compiler-based framework for big data analytics and machine learning • Goal: efficient large-scale analytics without sacrificing programmer productivity • Array-style programming • High performance • Built on ParallelAccelerator • Domain-specific compiler heuristics for parallelization • Use efficient HPC stack (e.g. MPI) • Bridge the enormous productivity-performance gap 2 HPAT overview HPAT.jl https://github.com/IntelLabs/HPAT.jl
  • 3. Logistic regression example: 3 Let’s do Data Science function logistic_regression(iterations) points = …some small data… responses = … D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return w end gepsoft.com
  • 4. • Challenges: • Long execution time • Data doesn’t fit in memory • Solution: • Parallelism: cluster or cloud • How: • MPI/C++ (”gold standard”), MPI/Julia, Spark (library) • HPAT (“smart compiler”) 4 What about large data? udel.edu
  • 5. 5 MPI/C++ (“gold standard”) herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1); // set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1); // open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL); hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double)); // create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1); ... • Message Passing Interface (MPI) • Pros: • Best performance! • Cons: • Need to understand parallelism, MPI, parallel I/O, C++ • High effort, tedious, error-prone, not readable… Example code (not readable):
  • 6. 6 MPI/Julia • MPI.jl • Pros: • Less effort than C++ • Cons: • Still need to understand parallelism, MPI • Parallel I/O • Needs high performance Julia • Infrastructure challenges Example code: function main() MPI.Init() … MPI.Allreduce!(send_arr, recv_arr, MPI.SUM, MPI.COMM_WORLD) … MPI.Finalize() end
  • 7. 7 Apache Spark • Map/reduce library • Master-executer model • Pros: • Easier than MPI/C++ • Lots of “system” features • Cons: • Development effort • Very slow • 100x slower than MPI/C++ • Host language overheads, loses locality, high library overheads, etc. Infoobjects.com Example code (Python): if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] X = matrix[:, 1:] return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1) def add(x, y): x += y return x for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add)
  • 8. 8 HPAT (“smart compiler”) • Julia code → MPI/C++ “gold standard” • “Smart parallelizing compiler” doesn’t exist, but… • Observations: • Array code is implicitly parallel • Parallelism is simple for data analytics • map/reduce pattern • 1D decomposition, allreduce communication HPAT.jl https://github.com/IntelLabs/HPAT.jl points labels 1D decomposition points labels w w
  • 9. using HPAT @acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return w end weights = logistic_regression(100,”mydata.hdf5”) $ mpirun –np 64 julia logistic_regression.jl 9 Logistic Regression (HPAT) https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl 95x speedup over Spark Parallel I/O
  • 10. double* logistic_regression(int64_t iterations, char* file) { int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank∗(N/mpi_nprocs); int myend = (mpi_rank+1)∗(N/mpi_nprocs); double *points = (double*)malloc((myend-mystart)*D*sizeof(double)); … ret = H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, …); ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, …); for(i=mystart ; i<myend ; i++) { … w_local = … } MPI_Allreduce(w, w_local, D , MPI_DOUBLE, MPI_SUM, 0 , MPI_COMM_WORLD); return w; } 10 Logistic Regression (generated) Initialization, partitioning, allocation Parallel I/O Computation Parameter synchronization “Bare-metal” MPI/C++ code with near-zero overhead!
  • 11. 11 HPAT usage • Dependencies: • ParallelAccelerator, Parallel HDF5, MPI • “@acc hpat” function annotation • Use matrix/vector operations, comprehensions • ParallelAccelerator operations • No “bad” for loops • Column-major matrices • All of program inside HPAT • I/O using DataSource HPAT.jl https://github.com/IntelLabs/HPAT.jl points labels 1D decomposition points labels w w
  • 12. 12 HPAT limitations • HPAT can fail to parallelize! • Limitations in compiler analysis • Needs good coding style • Fallback: explicit map/reduce, @parfor code • Only map/reduce parallel pattern supported • Data analytics, machine learning, optimization etc. • Others like stencils (PDEs) not supported yet • No sparse matrices yet • HDF5 and text file format HPAT.jl https://github.com/IntelLabs/HPAT.jl
  • 13. Init(state) # assume all arrays are partitioned while isChanged(state) inferArrayDistribution(state, node) end function inferArrayDistribution(state, node) if isAssignment(node) # lhs and rhs are sequential if either is sequential seq = isSeq(state, lhs) || isSeq(state, rhs) isSeq(state, lhs) = isSeq(state, rhs) = seq elseif isGEMM(node) # e.g. w = labels*points’ - shared parameter synchronization heuristic isSeq(lhs) = !isSeq(in1) && !isSeq(in2) && !isTransposed(in1) && isTransposed(in2) elseif isHPATcall(node) handleHPATcall(node) else # unknown call, assume all arrays are sequential isSeq(state, nodeArrs) = true end 13 Array Distribution Inference points labels w
  • 14. 14 HPAT Compiler Pipeline Macro-Pass Domain-Pass Distributed-Pass HPAT Code Generation (MPI) Domain-IR Julia source Parallel-IR CGen Julia Compiler MPI/C++ source Backend Compiler (ICC/GCC)
  • 15. • MacroPass • “desugar” extensions like DataSource • DomainPass • Generate variables, allocations, and function calls • Enable ParallelAccelerator pipeline • DistributedPass • Infer partitioned vs. sequential arrays • Divide allocations, parallelize I/O, and computation • “hook” distributed-memory libraries • Backend code generation • MPI code extension for CGen 15 HPAT Compiler Pipeline Macro-Pass Domain-Pass Distributed-Pass HPAT Code Generation (MPI)
  • 16. 16 HPAT vs. Spark 47 43 84 1061 182 1.55 0.69 0.05 11.13 7.95 0.01 0.1 1 10 100 1000 10000 1D SUM 1D SUM FILTER MONTE CARLO PI LOGISTIC REGRESSION K-MEANS EXECUTIONTIME(S) Spark HPAT 1680x 95x 23x30x 62x Cori at NERSC/LBL 64 nodes (2048 cores)
  • 17. 17 Strong Scaling- Logistic Regression 1061 800 634 11.13 5.89 4.11 1 2 4 8 16 32 64 128 256 512 1024 2048 64 (2k) 128 (4k) 256 (8k) ExecutionTime(s) Nodes (Cores) Scaling of Spark vs. HPATSpark HPAT 95x 154x
  • 18. ... kmeans::init::Distributed<step1Local,double,kmeans::init::randomDense> localInit(nClusters, nBlocks * nVectorsInBlock, rankId * nVectorsInBlock); localInit.input.set(kmeans::init::data, dataSource.getNumericTable()); localInit.compute(); services::SharedPtr<byte> serializedData; InputDataArchive dataArch; localInit.getPartialResult()->serialize( dataArch ); size_t perNodeArchLength = dataArch.getSizeOfArchive(); if (rankId == mpi_root) { serializedData = services::SharedPtr<byte>( new byte[ perNodeArchLength * nBlocks ] ); } byte *nodeResults = new byte[ perNodeArchLength ]; dataArch.copyArchiveToArray( nodeResults, perNodeArchLength ); MPI_Gather( nodeResults, perNodeArchLength, MPI_CHAR, serializedData.get(), perNodeArchLength, MPI_CHAR, mpi_root, MPI_COMM_WORLD); …. using HPAT @acc hpat function calcKmeans(k, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) clusters = HPAT.Kmeans(points, k) return clusters end 18 Machine Learning Libraries Intel® DAAL as backend (not readable)
  • 19. • Compiler development experience • The good: • Built-in type inference • Interospection/full control • The bad: • No detailed AST definition • Julia compiler surprises • Long HPAT/ParallelAccelerator compilation time! • Type inference every time 19 Building HPAT in Julia
  • 20. • Structured data processing without SQL! • Complex analytics all in array syntax • Inspired by TPCx-BB examples • Array syntax for table operations • Join, filter, aggregate • Similar to DataFrames.jl • Interesting compiler challenges • Optimize general AST instead of SQL trees • Other use cases • 2D decomposition 20 Ongoing HPAT development customer_i_class = aggregate(sale_items, :ss_customer_sk, :ss_item_count = length(:ss_item_sk), :id1 = sum(:i_class_id==1), :id2 = sum(:i_class_id==2), :id3 = sum(:i_class_id==3), :id4 = sum(:i_class_id==4)) Example:
  • 21. 21 Summary • High Performance Analytics Toolkit (HPAT) provides scripting abstractions and “bare-metal” performance • Matrix/vector operations, extension for Parallel I/O • Domain-specific compiler techniques • Generates efficient MPI/C++ • Uses existing HPC libraries • Much easier and faster than alternatives • Get involved! • Any contributions welcome • Need more and more use cases HPAT.jl https://github.com/IntelLabs/HPAT.jl
  • 23. • Goal: gain new insight from large datasets by domain experts • Productivity is 1st priority • Scripting languages most common, fast development • MPI/C++ not acceptable • Apache Hadoop and Spark dominant • Intuitive user interface (MapReduce, Python) • Master-executer library approach • Library approach is slow • Loses locality, high overheads • Orders of magnitude slower than handwritten MPI/C++ 23 Big Data Analytics is Slow Infoobjects.com http://hadoop.apache.org/ F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015. K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
  • 24. • Goal: gain new insight from large datasets by domain experts • Productivity is 1st priority • Scripting languages most common, fast development • MPI/C++ not acceptable • Apache Hadoop and Spark dominant • Intuitive user interface (MapReduce, Python) • Master-executer library approach • Library approach is slow • Loses locality, high overheads • Orders of magnitude slower than handwritten MPI/C++ 24 Big Data Analytics is Slow Infoobjects.com http://hadoop.apache.org/ F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015. K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
  • 25. • 1D partitioning of data per nodes • Domain-specific heuristic for machine learning, analytics • Not valid for other domains! • Column partitioning for matrices • Julia is column major • Needs good coding style by user • 1D partitioning of parfor iterations per node • Handle distributed-memory libraries • Generate necessary input/output transformations • Intel® Data Analytics Acceleration Library (Intel® DAAL) • Machine learning algorithms similar to Spark’s MLlib 25 DistributedPass: parallelization
  • 26. 26 Example: Distributed Data Source Transformation Flow points = DataSource(Matrix{Float64},HDF5,"/points",“data.hdf5") points::Matrix{Float64} = HPAT_h5_source("/points",“data.hdf5") h5_size1, h5_size2 = HPAT_h5_sizes("/points",“data.hdf5") points = alloc(h5_size1, h5_size2) HPAT_h5_read(points, "/points",“data.hdf5") # Enable Julia type inference # Enable ParallelAccelerator points = alloc(h5_size1, h5size_2/nprocs) HPAT_h5_read(points,"/points",“data.hdf5“,h5_size1,h5size_2/nprocs) # Parallelize H5Sselect_hyperslab(h5_size1, h5size_2/nprocs,…) H5Dread(points,"/points",“data.hdf5“,…) # Backend code Marco-Pass Domain-Pass Distributed-Pass HPAT-CGen
  • 27. • Parallel I/O example (HDF5, MPI/C++): 27 Backend Code Complexity herr_t ret; // set up file access property list with parallel I/O access hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS); assert(plist_id != -1); // set parallel access with communicator ret = H5Pset_fapl_mpio(plist_id, comm, info); assert(ret != -1); // open file file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id); assert(file_id != -1); ret=H5Pclose(plist_id); assert(ret != -1); // open dataset dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT); assert(dataset_id != -1); hid_t space_id = H5Dget_space(dataset_id); assert(space_id != -1); int num_dims = 0; num_dims = H5Sget_simple_extent_ndims(space_id); assert(num_dims==1); /* get data dimension info */ hsize_t sptprice_size; H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL); hsize_t my_sptprice_size = sptprice_size/mpi_nprocs; hsize_t my_start = my_sptprice_size*mpi_rank; my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double)); // create a file dataspace independently hid_t my_dataspace = H5Dget_space(dataset_id); assert(my_dataspace != -1); // stride and block are NULL for contiguous hyperslab ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL); assert(ret != -1); /* create a memory dataspace independently */ hid_t mem_dataspace = H5Screate_simple (1, &my_sptprice_size, NULL); assert (mem_dataspace != -1); /* set up the collective transfer properties list */ hid_t xfer_plist = H5Pcreate (H5P_DATASET_XFER); assert(xfer_plist != -1); ret = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE); assert(ret != -1); /* read data collectively */ ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, mem_dataspace, my_dataspace, xfer_plist, my_sptprice); assert(ret != -1); ... Hard to write manually!
  • 28. 28 Libraries 59 176 43 21 26 23 K-MEANS (LIB) LINEAR REGRESSION (LIB) NAÏVE BAYES (LIB) EXECUTIONTIME(S) Spark-Mllib HPAT-DAAL 2.8x 6.7x 1.9x
  • 29. 29 Benchmarks • Cori at LBL/NERSC • Dual Haswell nodes • Cray Aries (Dragonfly) network • 64 nodes (2048 cores) used • Spark 1.6.0 (default Cori installation) • Benchmarks • 1D_sum: sums 8.5 billion element vector from file • Pi: 1 billion random points • Logistic regression: 2 billion samples,10 features SP • K-Means: 320 million 20-feature DP, 10 iterations, 5 centers
  • 30. D = 10 # Number of dimensions if __name__ == "__main__": sc = SparkContext(appName="PythonLR") points = sc.textFile(file).mapPartitions(readPointBatch).cache() w = 2 * np.random.ranf(size=D) - 1 def gradient(matrix, w): Y = matrix[:, 0] # point labels (first column of input file) X = matrix[:, 1:] # point coordinates return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1) def add(x, y): x += y return x for i in range(iterations): w -= points.map(lambda m: gradient(m, w)).reduce(add) 30 Logistic Regression (Spark) https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py Scheduling, TCP/IP overheads Python overheads
  • 31. using HPAT @acc hpat function logistic_regression(iterations, file) points = DataSource(Matrix{Float64},HDF5,"/points", file) responses = DataSource(Vector{Float64},HDF5,"/responses",file) D = size(points,1) N = size(points,2) labels = reshape(responses,1,N) w = reshape(2*rand(D)-1,1,D) for i in 1:iterations w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points' end return w end weights = logistic_regression(100,”mydata.hdf5”) $ mpirun –np 64 julia logistic_regression.jl 31 Logistic Regression (HPAT) https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl 95x speedup!
  • 32. from pyspark import SparkContext if __name__ == "__main__": sc = SparkContext(appName="PythonPi") n = 100000 * partitions def f(_): x = random() * 2 - 1 y = random() * 2 – 1 return 1 if x ** 2 + y ** 2 < 1 else 0 def add(x, y): x += y return x count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) sc.stop() 32 Monte Carlo Pi (Spark) https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py Scheduling overheads Extra array map reduce
  • 33. double calcPi(int64_t N) { int mpi_rank , mpi_nprocs; MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank); int mystart = mpi_rank∗(N/mpi_nprocs); int myend = (mpi_rank+1)∗(N/mpi_nprocs); for(i=mystart ; i<myend ; i++) { x = rand(..); y = rand(..); sum_local += … } MPI_Reduce(…); return out; } using HPAT @acc hpat function calcPi(n) x = rand(n) .* 2.0 .- 1.0 y = rand(n) .* 2.0 .- 1.0 return 4.0*sum(x.^2 .+ y.^2 .< 1.0)/n end myPi = calcPi(10^9) $ mpirun –np 64 julia pi.jl 33 Monte Carlo Pi (HPAT) https://github.com/IntelLabs/HPAT.jl/blob/master/examples/pi.jl 1600x speedup! Computation done in registers

Hinweis der Redaktion

  1. ParallelAccelerator acknowledgement
  2. Data analytics and similar domains
  3. Readable code, close to math formula
  4. National lab supercomputers
  5. Decades of parallel compiler research
  6. DataSource for parallel-io
  7. Partitioning data and computation
  8. Make sure dependencies are installed properly Iteration loops are OK
  9. Lots of effort to make compiler robust
  10. Secret sauce
  11. Same as ParallelAccelerator First class arrays
  12. LBL use case