HPAT presentation at JuliaCon 2016

HPAT.jl - Easy and Fast Big Data
Analytics
1
Ehsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana Shpeisman
Parallel Computing Lab, Intel Labs
*Intern from UIUC
JuliaCon 2016

High Performance Analytics Toolkit (HPAT): a compiler-based
framework for big data analytics and machine learning
• Goal: efficient large-scale analytics without sacrificing
programmer productivity
• Array-style programming
• High performance
• Built on ParallelAccelerator
• Domain-specific compiler heuristics for parallelization
• Use efficient HPC stack (e.g. MPI)
• Bridge the enormous productivity-performance gap
2
HPAT overview
HPAT.jl
https://github.com/IntelLabs/HPAT.jl

Logistic regression example:
3
Let’s do Data Science
function logistic_regression(iterations)
points = …some small data…
responses = …
D = size(points,1)
N = size(points,2)
labels = reshape(responses,1,N)
w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations
w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points'
end
return w
end
gepsoft.com

• Challenges:
• Long execution time
• Data doesn’t fit in memory
• Solution:
• Parallelism: cluster or cloud
• How:
• MPI/C++ (”gold standard”), MPI/Julia, Spark (library)
• HPAT (“smart compiler”)
4
What about large data?
udel.edu

5
MPI/C++ (“gold standard”)
herr_t ret;
// set up file access property list with parallel I/O access
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
assert(plist_id != -1);
// set parallel access with communicator
ret = H5Pset_fapl_mpio(plist_id, comm, info);
assert(ret != -1);
// open file
file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id);
assert(file_id != -1);
ret=H5Pclose(plist_id);
assert(ret != -1);
// open dataset
dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT);
assert(dataset_id != -1);
hid_t space_id = H5Dget_space(dataset_id);
assert(space_id != -1);
int num_dims = 0;
num_dims = H5Sget_simple_extent_ndims(space_id);
assert(num_dims==1);
/* get data dimension info */
hsize_t sptprice_size;
H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL);
hsize_t my_sptprice_size = sptprice_size/mpi_nprocs;
hsize_t my_start = my_sptprice_size*mpi_rank;
my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double));
// create a file dataspace independently
hid_t my_dataspace = H5Dget_space(dataset_id);
assert(my_dataspace != -1);
// stride and block are NULL for contiguous hyperslab
ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL,
&my_sptprice_size, NULL);
assert(ret != -1);
...
• Message Passing Interface (MPI)
• Pros:
• Best performance!
• Cons:
• Need to understand parallelism,
MPI, parallel I/O, C++
• High effort, tedious, error-prone,
not readable…
Example code (not readable):

6
MPI/Julia
• MPI.jl
• Pros:
• Less effort than C++
• Cons:
• Still need to understand parallelism,
MPI
• Parallel I/O
• Needs high performance Julia
• Infrastructure challenges
Example code:
function main()
MPI.Init()
…
MPI.Allreduce!(send_arr, recv_arr, MPI.SUM,
MPI.COMM_WORLD)
…
MPI.Finalize()
end

7
Apache Spark
• Map/reduce library
• Master-executer model
• Pros:
• Easier than MPI/C++
• Lots of “system” features
• Cons:
• Development effort
• Very slow
• 100x slower than MPI/C++
• Host language overheads, loses
locality, high library overheads, etc.
Infoobjects.com
Example code (Python):
if __name__ == "__main__":
sc = SparkContext(appName="PythonLR")
points = sc.textFile(file).mapPartitions(readPointBatch).cache()
w = 2 * np.random.ranf(size=D) - 1
def gradient(matrix, w):
Y = matrix[:, 0]
X = matrix[:, 1:]
return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y):
x += y
return x
for i in range(iterations):
w -= points.map(lambda m: gradient(m, w)).reduce(add)

8
HPAT (“smart compiler”)
• Julia code → MPI/C++ “gold
standard”
• “Smart parallelizing compiler”
doesn’t exist, but…
• Observations:
• Array code is implicitly parallel
• Parallelism is simple for data
analytics
• map/reduce pattern
• 1D decomposition, allreduce
communication
HPAT.jl
points
labels
1D decomposition
points
labels
w
w

using HPAT
@acc hpat function logistic_regression(iterations, file)
points = DataSource(Matrix{Float64},HDF5,"/points", file)
responses = DataSource(Vector{Float64},HDF5,"/responses",file)
D = size(points,1)
N = size(points,2)
end
return w
end
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
9
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup over
Spark
Parallel
I/O

double* logistic_regression(int64_t iterations, char* file)
{
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
int mystart = mpi_rank∗(N/mpi_nprocs);
int myend = (mpi_rank+1)∗(N/mpi_nprocs);
double *points = (double*)malloc((myend-mystart)*D*sizeof(double));
…
ret = H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, …);
ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, …);
for(i=mystart ; i<myend ; i++) {
…
w_local = …
}
MPI_Allreduce(w, w_local, D , MPI_DOUBLE, MPI_SUM, 0 , MPI_COMM_WORLD);
return w;
}
10
Logistic Regression (generated)
Initialization,
partitioning,
allocation
Parallel I/O
Computation
Parameter
synchronization
“Bare-metal” MPI/C++ code with near-zero overhead!

11
HPAT usage
• Dependencies:
• ParallelAccelerator, Parallel HDF5,
MPI
• “@acc hpat” function annotation
• Use matrix/vector operations,
comprehensions
• ParallelAccelerator operations
• No “bad” for loops
• Column-major matrices
• All of program inside HPAT
• I/O using DataSource
HPAT.jl
points
labels
1D decomposition
points
labels
w
w

12
HPAT limitations
• HPAT can fail to parallelize!
• Limitations in compiler analysis
• Needs good coding style
• Fallback: explicit map/reduce, @parfor code
• Only map/reduce parallel pattern supported
• Data analytics, machine learning, optimization etc.
• Others like stencils (PDEs) not supported yet
• No sparse matrices yet
• HDF5 and text file format
HPAT.jl

Init(state) # assume all arrays are partitioned
while isChanged(state)
inferArrayDistribution(state, node)
end
function inferArrayDistribution(state, node)
if isAssignment(node)
# lhs and rhs are sequential if either is sequential
seq = isSeq(state, lhs) || isSeq(state, rhs)
isSeq(state, lhs) = isSeq(state, rhs) = seq
elseif isGEMM(node)
# e.g. w = labels*points’ - shared parameter synchronization heuristic
isSeq(lhs) = !isSeq(in1) && !isSeq(in2) && !isTransposed(in1) && isTransposed(in2)
elseif isHPATcall(node)
handleHPATcall(node)
else
# unknown call, assume all arrays are sequential
isSeq(state, nodeArrs) = true
end
13
Array Distribution Inference
points
labels
w

14
HPAT Compiler Pipeline
Macro-Pass Domain-Pass
Distributed-Pass
HPAT Code
Generation (MPI)
Domain-IR
Julia source
Parallel-IR CGen
Julia Compiler
MPI/C++
source
Backend Compiler
(ICC/GCC)

• MacroPass
• “desugar” extensions like DataSource
• DomainPass
• Generate variables, allocations, and function calls
• Enable ParallelAccelerator pipeline
• DistributedPass
• Infer partitioned vs. sequential arrays
• Divide allocations, parallelize I/O, and computation
• “hook” distributed-memory libraries
• Backend code generation
• MPI code extension for CGen
15
HPAT Compiler Pipeline
Macro-Pass
Domain-Pass
Distributed-Pass
HPAT Code
Generation (MPI)

16
HPAT vs. Spark
47
43
84
1061
182
1.55
0.69
0.05
11.13
7.95
0.01
0.1
1
10
100
1000
10000
1D SUM 1D SUM
FILTER
MONTE
CARLO PI
LOGISTIC
REGRESSION
K-MEANS
EXECUTIONTIME(S)
Spark HPAT
1680x
95x
23x30x 62x
Cori at NERSC/LBL
64 nodes (2048 cores)

17
Strong Scaling- Logistic Regression
1061
800
634
11.13
5.89
4.11
1
2
4
8
16
32
64
128
256
512
1024
2048
64 (2k) 128 (4k) 256 (8k)
ExecutionTime(s)
Nodes (Cores)
Scaling of Spark vs. HPATSpark HPAT
95x 154x

...
kmeans::init::Distributed<step1Local,double,kmeans::init::randomDense> localInit(nClusters, nBlocks * nVectorsInBlock, rankId *
nVectorsInBlock);
localInit.input.set(kmeans::init::data, dataSource.getNumericTable());
localInit.compute();
services::SharedPtr<byte> serializedData;
InputDataArchive dataArch;
localInit.getPartialResult()->serialize( dataArch );
size_t perNodeArchLength = dataArch.getSizeOfArchive();
if (rankId == mpi_root) {
serializedData = services::SharedPtr<byte>( new byte[ perNodeArchLength * nBlocks ] );
}
byte *nodeResults = new byte[ perNodeArchLength ];
dataArch.copyArchiveToArray( nodeResults, perNodeArchLength );
MPI_Gather( nodeResults, perNodeArchLength, MPI_CHAR, serializedData.get(), perNodeArchLength, MPI_CHAR,
mpi_root, MPI_COMM_WORLD);
….
using HPAT
@acc hpat function calcKmeans(k, file)
clusters = HPAT.Kmeans(points, k)
return clusters
end
18
Machine Learning Libraries
Intel® DAAL as backend
(not readable)

• Compiler development experience
• The good:
• Built-in type inference
• Interospection/full control
• The bad:
• No detailed AST definition
• Julia compiler surprises
• Long HPAT/ParallelAccelerator compilation time!
• Type inference every time
19
Building HPAT in Julia

• Structured data processing without SQL!
• Complex analytics all in array syntax
• Inspired by TPCx-BB examples
• Array syntax for table operations
• Join, filter, aggregate
• Similar to DataFrames.jl
• Interesting compiler challenges
• Optimize general AST instead of SQL trees
• Other use cases
• 2D decomposition
20
Ongoing HPAT development
customer_i_class =
aggregate(sale_items, :ss_customer_sk,
:ss_item_count = length(:ss_item_sk),
:id1 = sum(:i_class_id==1),
:id4 = sum(:i_class_id==4))
Example:

21
Summary
• High Performance Analytics Toolkit (HPAT) provides scripting
abstractions and “bare-metal” performance
• Matrix/vector operations, extension for Parallel I/O
• Domain-specific compiler techniques
• Generates efficient MPI/C++
• Uses existing HPC libraries
• Much easier and faster than alternatives
• Get involved!
• Any contributions welcome
• Need more and more use cases
HPAT.jl

• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most common, fast development
• MPI/C++ not acceptable
• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)
• Master-executer library approach
• Library approach is slow
• Loses locality, high overheads
• Orders of magnitude slower than handwritten MPI/C++
23
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.
K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.

• Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most common, fast development
• MPI/C++ not acceptable
• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)
• Master-executer library approach
• Library approach is slow
• Loses locality, high overheads
• Orders of magnitude slower than handwritten MPI/C++
24
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.
K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.

• 1D partitioning of data per nodes
• Domain-specific heuristic for machine learning, analytics
• Not valid for other domains!
• Column partitioning for matrices
• Julia is column major
• Needs good coding style by user
• 1D partitioning of parfor iterations per node
• Handle distributed-memory libraries
• Generate necessary input/output transformations
• Intel® Data Analytics Acceleration Library (Intel® DAAL)
• Machine learning algorithms similar to Spark’s MLlib
25
DistributedPass: parallelization

26
Example: Distributed Data Source Transformation Flow
points = DataSource(Matrix{Float64},HDF5,"/points",“data.hdf5")
points::Matrix{Float64} = HPAT_h5_source("/points",“data.hdf5")
h5_size1, h5_size2 = HPAT_h5_sizes("/points",“data.hdf5")
points = alloc(h5_size1, h5_size2)
HPAT_h5_read(points, "/points",“data.hdf5")
# Enable Julia type inference
# Enable ParallelAccelerator
points = alloc(h5_size1, h5size_2/nprocs)
HPAT_h5_read(points,"/points",“data.hdf5“,h5_size1,h5size_2/nprocs)
# Parallelize
H5Sselect_hyperslab(h5_size1, h5size_2/nprocs,…)
H5Dread(points,"/points",“data.hdf5“,…)
# Backend code
Marco-Pass
Domain-Pass
Distributed-Pass
HPAT-CGen

• Parallel I/O example (HDF5, MPI/C++):
27
Backend Code Complexity
herr_t ret;
// set up file access property list with parallel I/O access
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
assert(plist_id != -1);
// set parallel access with communicator
ret = H5Pset_fapl_mpio(plist_id, comm, info);
assert(ret != -1);
// open file
file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id);
assert(file_id != -1);
ret=H5Pclose(plist_id);
assert(ret != -1);
// open dataset
dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT);
assert(dataset_id != -1);
hid_t space_id = H5Dget_space(dataset_id);
assert(space_id != -1);
int num_dims = 0;
num_dims = H5Sget_simple_extent_ndims(space_id);
assert(num_dims==1);
/* get data dimension info */
hsize_t sptprice_size;
H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL);
hsize_t my_sptprice_size = sptprice_size/mpi_nprocs;
hsize_t my_start = my_sptprice_size*mpi_rank;
my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double));
// create a file dataspace independently
hid_t my_dataspace = H5Dget_space(dataset_id);
assert(my_dataspace != -1);
// stride and block are NULL for contiguous hyperslab
ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL, &my_sptprice_size, NULL);
assert(ret != -1);
/* create a memory dataspace independently */
hid_t mem_dataspace = H5Screate_simple (1, &my_sptprice_size, NULL);
assert (mem_dataspace != -1);
/* set up the collective transfer properties list */
hid_t xfer_plist = H5Pcreate (H5P_DATASET_XFER);
assert(xfer_plist != -1);
ret = H5Pset_dxpl_mpio(xfer_plist, H5FD_MPIO_COLLECTIVE);
assert(ret != -1);
/* read data collectively */
ret = H5Dread(dataset_id, H5T_NATIVE_FLOAT, mem_dataspace, my_dataspace, xfer_plist, my_sptprice);
assert(ret != -1);
...
Hard to write
manually!

28
Libraries
59
176
43
21
26
23
K-MEANS (LIB) LINEAR REGRESSION (LIB) NAÏVE BAYES (LIB)
EXECUTIONTIME(S)
Spark-Mllib HPAT-DAAL
2.8x
6.7x
1.9x

29
Benchmarks
• Cori at LBL/NERSC
• Dual Haswell nodes
• Cray Aries (Dragonfly) network
• 64 nodes (2048 cores) used
• Spark 1.6.0 (default Cori installation)
• Benchmarks
• 1D_sum: sums 8.5 billion element vector from file
• Pi: 1 billion random points
• Logistic regression: 2 billion samples,10 features SP
• K-Means: 320 million 20-feature DP, 10 iterations, 5 centers

D = 10 # Number of dimensions
if __name__ == "__main__":
sc = SparkContext(appName="PythonLR")
points = sc.textFile(file).mapPartitions(readPointBatch).cache()
w = 2 * np.random.ranf(size=D) - 1
def gradient(matrix, w):
Y = matrix[:, 0] # point labels (first column of input file)
X = matrix[:, 1:] # point coordinates
return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y):
x += y
return x
for i in range(iterations):
w -= points.map(lambda m: gradient(m, w)).reduce(add)
30
Logistic Regression (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py
Scheduling,
TCP/IP overheads
Python
overheads

using HPAT
@acc hpat function logistic_regression(iterations, file)
responses = DataSource(Vector{Float64},HDF5,"/responses",file)
D = size(points,1)
N = size(points,2)
end
return w
end
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
31
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup!

from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 – 1
return 1 if x ** 2 + y ** 2 < 1 else 0
def add(x, y):
x += y
return x
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
sc.stop()
32
Monte Carlo Pi (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py
Scheduling overheads
Extra array
map
reduce

double calcPi(int64_t N) {
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
int mystart = mpi_rank∗(N/mpi_nprocs);
int myend = (mpi_rank+1)∗(N/mpi_nprocs);
for(i=mystart ; i<myend ; i++) {
x = rand(..);
y = rand(..);
sum_local += …
}
MPI_Reduce(…);
return out;
}
using HPAT
@acc hpat function calcPi(n)
x = rand(n) .* 2.0 .- 1.0
y = rand(n) .* 2.0 .- 1.0
return 4.0*sum(x.^2 .+ y.^2 .< 1.0)/n
end
myPi = calcPi(10^9)
$ mpirun –np 64 julia pi.jl
33
Monte Carlo Pi (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/pi.jl
1600x speedup!
Computation done
in registers

HPAT presentation at JuliaCon 2016

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie HPAT presentation at JuliaCon 2016

Ähnlich wie HPAT presentation at JuliaCon 2016 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HPAT presentation at JuliaCon 2016

Hinweis der Redaktion