High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
1. HPAT.jl - Easy and Fast Big Data
Analytics
1
Ehsan Totoni, Todd Anderson, Wajih Ul Hassan*, Tatiana Shpeisman
Parallel Computing Lab, Intel Labs
*Intern from UIUC
JuliaCon 2016
2. High Performance Analytics Toolkit (HPAT): a compiler-based
framework for big data analytics and machine learning
• Goal: efficient large-scale analytics without sacrificing
programmer productivity
• Array-style programming
• High performance
• Built on ParallelAccelerator
• Domain-specific compiler heuristics for parallelization
• Use efficient HPC stack (e.g. MPI)
• Bridge the enormous productivity-performance gap
2
HPAT overview
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
3. Logistic regression example:
3
Let’s do Data Science
function logistic_regression(iterations)
points = …some small data…
responses = …
D = size(points,1)
N = size(points,2)
labels = reshape(responses,1,N)
w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations
w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points'
end
return w
end
gepsoft.com
4. • Challenges:
• Long execution time
• Data doesn’t fit in memory
• Solution:
• Parallelism: cluster or cloud
• How:
• MPI/C++ (”gold standard”), MPI/Julia, Spark (library)
• HPAT (“smart compiler”)
4
What about large data?
udel.edu
5. 5
MPI/C++ (“gold standard”)
herr_t ret;
// set up file access property list with parallel I/O access
hid_t plist_id = H5Pcreate(H5P_FILE_ACCESS);
assert(plist_id != -1);
// set parallel access with communicator
ret = H5Pset_fapl_mpio(plist_id, comm, info);
assert(ret != -1);
// open file
file_id = H5Fopen("/lsf/lsf09/sptprice.hdf5", H5F_ACC_RDONLY, plist_id);
assert(file_id != -1);
ret=H5Pclose(plist_id);
assert(ret != -1);
// open dataset
dataset_id = H5Dopen2(file_id, "/sptprice", H5P_DEFAULT);
assert(dataset_id != -1);
hid_t space_id = H5Dget_space(dataset_id);
assert(space_id != -1);
int num_dims = 0;
num_dims = H5Sget_simple_extent_ndims(space_id);
assert(num_dims==1);
/* get data dimension info */
hsize_t sptprice_size;
H5Sget_simple_extent_dims(space_id, &sptprice_size, NULL);
hsize_t my_sptprice_size = sptprice_size/mpi_nprocs;
hsize_t my_start = my_sptprice_size*mpi_rank;
my_sptprice = (double*)malloc(my_sptprice_size*sizeof(double));
// create a file dataspace independently
hid_t my_dataspace = H5Dget_space(dataset_id);
assert(my_dataspace != -1);
// stride and block are NULL for contiguous hyperslab
ret=H5Sselect_hyperslab(my_dataspace, H5S_SELECT_SET, &my_start, NULL,
&my_sptprice_size, NULL);
assert(ret != -1);
...
• Message Passing Interface (MPI)
• Pros:
• Best performance!
• Cons:
• Need to understand parallelism,
MPI, parallel I/O, C++
• High effort, tedious, error-prone,
not readable…
Example code (not readable):
6. 6
MPI/Julia
• MPI.jl
• Pros:
• Less effort than C++
• Cons:
• Still need to understand parallelism,
MPI
• Parallel I/O
• Needs high performance Julia
• Infrastructure challenges
Example code:
function main()
MPI.Init()
…
MPI.Allreduce!(send_arr, recv_arr, MPI.SUM,
MPI.COMM_WORLD)
…
MPI.Finalize()
end
7. 7
Apache Spark
• Map/reduce library
• Master-executer model
• Pros:
• Easier than MPI/C++
• Lots of “system” features
• Cons:
• Development effort
• Very slow
• 100x slower than MPI/C++
• Host language overheads, loses
locality, high library overheads, etc.
Infoobjects.com
Example code (Python):
if __name__ == "__main__":
sc = SparkContext(appName="PythonLR")
points = sc.textFile(file).mapPartitions(readPointBatch).cache()
w = 2 * np.random.ranf(size=D) - 1
def gradient(matrix, w):
Y = matrix[:, 0]
X = matrix[:, 1:]
return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y):
x += y
return x
for i in range(iterations):
w -= points.map(lambda m: gradient(m, w)).reduce(add)
8. 8
HPAT (“smart compiler”)
• Julia code → MPI/C++ “gold
standard”
• “Smart parallelizing compiler”
doesn’t exist, but…
• Observations:
• Array code is implicitly parallel
• Parallelism is simple for data
analytics
• map/reduce pattern
• 1D decomposition, allreduce
communication
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
points
labels
1D decomposition
points
labels
w
w
9. using HPAT
@acc hpat function logistic_regression(iterations, file)
points = DataSource(Matrix{Float64},HDF5,"/points", file)
responses = DataSource(Vector{Float64},HDF5,"/responses",file)
D = size(points,1)
N = size(points,2)
labels = reshape(responses,1,N)
w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations
w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points'
end
return w
end
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
9
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup over
Spark
Parallel
I/O
11. 11
HPAT usage
• Dependencies:
• ParallelAccelerator, Parallel HDF5,
MPI
• “@acc hpat” function annotation
• Use matrix/vector operations,
comprehensions
• ParallelAccelerator operations
• No “bad” for loops
• Column-major matrices
• All of program inside HPAT
• I/O using DataSource
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
points
labels
1D decomposition
points
labels
w
w
12. 12
HPAT limitations
• HPAT can fail to parallelize!
• Limitations in compiler analysis
• Needs good coding style
• Fallback: explicit map/reduce, @parfor code
• Only map/reduce parallel pattern supported
• Data analytics, machine learning, optimization etc.
• Others like stencils (PDEs) not supported yet
• No sparse matrices yet
• HDF5 and text file format
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
13. Init(state) # assume all arrays are partitioned
while isChanged(state)
inferArrayDistribution(state, node)
end
function inferArrayDistribution(state, node)
if isAssignment(node)
# lhs and rhs are sequential if either is sequential
seq = isSeq(state, lhs) || isSeq(state, rhs)
isSeq(state, lhs) = isSeq(state, rhs) = seq
elseif isGEMM(node)
# e.g. w = labels*points’ - shared parameter synchronization heuristic
isSeq(lhs) = !isSeq(in1) && !isSeq(in2) && !isTransposed(in1) && isTransposed(in2)
elseif isHPATcall(node)
handleHPATcall(node)
else
# unknown call, assume all arrays are sequential
isSeq(state, nodeArrs) = true
end
13
Array Distribution Inference
points
labels
w
19. • Compiler development experience
• The good:
• Built-in type inference
• Interospection/full control
• The bad:
• No detailed AST definition
• Julia compiler surprises
• Long HPAT/ParallelAccelerator compilation time!
• Type inference every time
19
Building HPAT in Julia
20. • Structured data processing without SQL!
• Complex analytics all in array syntax
• Inspired by TPCx-BB examples
• Array syntax for table operations
• Join, filter, aggregate
• Similar to DataFrames.jl
• Interesting compiler challenges
• Optimize general AST instead of SQL trees
• Other use cases
• 2D decomposition
20
Ongoing HPAT development
customer_i_class =
aggregate(sale_items, :ss_customer_sk,
:ss_item_count = length(:ss_item_sk),
:id1 = sum(:i_class_id==1),
:id2 = sum(:i_class_id==2),
:id3 = sum(:i_class_id==3),
:id4 = sum(:i_class_id==4))
Example:
21. 21
Summary
• High Performance Analytics Toolkit (HPAT) provides scripting
abstractions and “bare-metal” performance
• Matrix/vector operations, extension for Parallel I/O
• Domain-specific compiler techniques
• Generates efficient MPI/C++
• Uses existing HPC libraries
• Much easier and faster than alternatives
• Get involved!
• Any contributions welcome
• Need more and more use cases
HPAT.jl
https://github.com/IntelLabs/HPAT.jl
23. • Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most common, fast development
• MPI/C++ not acceptable
• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)
• Master-executer library approach
• Library approach is slow
• Loses locality, high overheads
• Orders of magnitude slower than handwritten MPI/C++
23
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.
K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
24. • Goal: gain new insight from large datasets by domain experts
• Productivity is 1st priority
• Scripting languages most common, fast development
• MPI/C++ not acceptable
• Apache Hadoop and Spark dominant
• Intuitive user interface (MapReduce, Python)
• Master-executer library approach
• Library approach is slow
• Loses locality, high overheads
• Orders of magnitude slower than handwritten MPI/C++
24
Big Data Analytics is Slow
Infoobjects.com
http://hadoop.apache.org/
F. McSherry, et. al. “Scalability! But at what COST?”, HotOS 2015.
K. Brown, et al. "Have abstraction and eat performance, too: optimized heterogeneous computing with parallel patterns“, CGO 2016.
25. • 1D partitioning of data per nodes
• Domain-specific heuristic for machine learning, analytics
• Not valid for other domains!
• Column partitioning for matrices
• Julia is column major
• Needs good coding style by user
• 1D partitioning of parfor iterations per node
• Handle distributed-memory libraries
• Generate necessary input/output transformations
• Intel® Data Analytics Acceleration Library (Intel® DAAL)
• Machine learning algorithms similar to Spark’s MLlib
25
DistributedPass: parallelization
29. 29
Benchmarks
• Cori at LBL/NERSC
• Dual Haswell nodes
• Cray Aries (Dragonfly) network
• 64 nodes (2048 cores) used
• Spark 1.6.0 (default Cori installation)
• Benchmarks
• 1D_sum: sums 8.5 billion element vector from file
• Pi: 1 billion random points
• Logistic regression: 2 billion samples,10 features SP
• K-Means: 320 million 20-feature DP, 10 iterations, 5 centers
30. D = 10 # Number of dimensions
if __name__ == "__main__":
sc = SparkContext(appName="PythonLR")
points = sc.textFile(file).mapPartitions(readPointBatch).cache()
w = 2 * np.random.ranf(size=D) - 1
def gradient(matrix, w):
Y = matrix[:, 0] # point labels (first column of input file)
X = matrix[:, 1:] # point coordinates
return ((1.0 / (1.0 + np.exp(-Y * X.dot(w))) - 1.0) * Y * X.T).sum(1)
def add(x, y):
x += y
return x
for i in range(iterations):
w -= points.map(lambda m: gradient(m, w)).reduce(add)
30
Logistic Regression (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/logistic_regression.py
Scheduling,
TCP/IP overheads
Python
overheads
31. using HPAT
@acc hpat function logistic_regression(iterations, file)
points = DataSource(Matrix{Float64},HDF5,"/points", file)
responses = DataSource(Vector{Float64},HDF5,"/responses",file)
D = size(points,1)
N = size(points,2)
labels = reshape(responses,1,N)
w = reshape(2*rand(D)-1,1,D)
for i in 1:iterations
w -= ((1./(1+exp(-labels.*(w*points)))-1).*labels)*points'
end
return w
end
weights = logistic_regression(100,”mydata.hdf5”)
$ mpirun –np 64 julia logistic_regression.jl
31
Logistic Regression (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/logistic_regression.jl
95x speedup!
32. from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonPi")
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 – 1
return 1 if x ** 2 + y ** 2 < 1 else 0
def add(x, y):
x += y
return x
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
sc.stop()
32
Monte Carlo Pi (Spark)
https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py
Scheduling overheads
Extra array
map
reduce
33. double calcPi(int64_t N) {
int mpi_rank , mpi_nprocs;
MPI_Comm_size(MPI_COMM_WORLD,&mpi_nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&mpi_rank);
int mystart = mpi_rank∗(N/mpi_nprocs);
int myend = (mpi_rank+1)∗(N/mpi_nprocs);
for(i=mystart ; i<myend ; i++) {
x = rand(..);
y = rand(..);
sum_local += …
}
MPI_Reduce(…);
return out;
}
using HPAT
@acc hpat function calcPi(n)
x = rand(n) .* 2.0 .- 1.0
y = rand(n) .* 2.0 .- 1.0
return 4.0*sum(x.^2 .+ y.^2 .< 1.0)/n
end
myPi = calcPi(10^9)
$ mpirun –np 64 julia pi.jl
33
Monte Carlo Pi (HPAT)
https://github.com/IntelLabs/HPAT.jl/blob/master/examples/pi.jl
1600x speedup!
Computation done
in registers
Hinweis der Redaktion
ParallelAccelerator acknowledgement
Data analytics and similar domains
Readable code, close to math formula
National lab supercomputers
Decades of parallel compiler research
DataSource for parallel-io
Partitioning data and computation
Make sure dependencies are installed properly
Iteration loops are OK