SlideShare ist ein Scribd-Unternehmen logo
1 von 11
Downloaden Sie, um offline zu lesen
CS 240A - Parallel Implementation of K Means Clustering on
CUDA
Lan Liu, Pritha D N
December 9, 2016
Abstract
K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be
time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K-
Means clustering algorithm on CUDA using C. We present the performance analysis and implementation
of our approach to parallelizing K-Means clustering.
1 K-Means Clustering
In this section, we provide an overview of K-Means clustering, the mathematical description and the
sequential algorithm has been presented, and the complexity of sequential code has been analyzed.
1.1 Description
K-Means clustering is one of the most widely used clustering method used in data mining, it aims to partition
N given data points into K clusters, in which each cluster has the most similarity, namely, each data point
belongs to the cluster with the nearest mean.
Assume there are N data points in d dimension, (x1, x2, ...., xn) ⊂ Rd, and we need to classify them into
K clusters S = (S1, S2, ..., SK ), where K is generally fixed apriori. Define the center of each cluster Si as the
mean of all data points x ∈ Si, i.e.
µi
x∈Si
x
|Si |
, where |Si | denotes size of Si
The goal of clustering is to minimize the total Euclidean distance from each data point to its cluster center,
i.e. find clustering S to minimize the objective function:
cost(S) =
k
i=1 x∈Si
||x − µi ||2
Finding the global minimum of the objective function is computationally challenging (NP-hard). The
commonly used algorithm is really a heuristic which can find a local minimum instead of global minimum,
and the trade off is that computational cost is cheaper. The commonly used k-means algorithm is as follows:
Step1: Initialize cluster centroids µ1, µ2, ..., µk ∈ Rn randomly.
Step2: Repeat until convergence:{
Assignment step: In order to assign each data point xi to the nearest cluster Si, define ci as membership
index of xi,
ci
:= argmink
j=1||xi − µj ||2
Update centroids step: For each j, update
µj :=
N
i=1 1ci =j xi
N
i=1 1ci =j
}
Since generally the clustering problem is not convex, there might be a lot of local minimum. For any
fixed initial condition, It can be easily proved that cost(S) decreases for every iteration step, and thus the
1
algorithm converges to a unique local minimum depending on which initial condition is given. For example,
set x = (2, 4, 5, 6), if initial centers are µ1 = 2, µ2 = 5, then S1 = (2), S2 = (4, 5, 6), if initial centers are
µ1 = 4, µ2 = 5, then S1 = (2, 4), S2 = (5, 6), the first is global minimal with cost=2, and the second is a
saddle with cost=2.5.
Consider this, people might generate a couple of result using random initial centers, and choose the best
with smallest objective function. And this also drives us to apply parallel computing to save running time.
1.2 Algorithm
Based on the description, we apply the following heuristic K Means clustering sequential algorithm, this
algorithm is based on the Professor Wei-keng Liao’s k-means clustering code [1]. We have modified the
code about how to recompute the new cluster center. In that code it sumerizes all data points in the new
cluster to compute center in every iteration step, considering that only a portion of data changes membership
(and being less and less as iteration goes), we instead treat the changing data by adding the data into new
cluster and removing it from old cluster, this will be more efficient and the result also verifies this.
Step 1: Pick the first K data points as initial cluster centers.
Step2: Attribute each data point to the nearest cluster.
Step3: For each reassigned data point, membership change increase by 1.
Step4: Set the position of each new cluster to be the mean of all data points belonging to that cluster.
Step 5: Repeat steps 2-4 until convergence.
The pseudo code is as follows:
Let N be the number of data points, K be the number of clusters.
data[N]: the array of data objects
center[K]: the array of cluster centers
membership[N]: the array of data point membership.
clustersum[K]: the sum of data points in Kth cluster.
clustersize[K]: the size of Kth cluster.
δ: count the number of membership change.
threshold: critical value to define stop condition, we set it be 0.001.
for i from 0 to K-1
center[i] ←− data[i]
do{
δ ←− 0
for i from 0 to N − 1
mindis=||data[i]-center[0]||
for j from 1 to K − 1
distance=||data[i]-center[j]||
if distance<mindis
mindis←−distance
index←−j
if first iteration
δ ←−N
membership[i]←−index
clustersize[index]←−clustersize[index]+1
clustersum[index]←−clustersum[index]+data[i]
else if membership[i] index
δ = δ + 1
clustersize[index]←−clustersize[index]+1
clustersize[membership[i]]←−clustersize[membership[i]]-1
clustersum[index]←−clustersum[index]+data[i]
clustersum[membership[i]]←−clustersum[membership[i]]-data[i]
membership[i]←−index
for j from 0 to K − 1
center[j]←−clustersum[j]/clustersize[j]
} While(δ/N>threshold)
2
Note: stop condition is δ/N<threshold, i.e., the number of membership changes is 1‰of all datas.
Complexity Analysis: The sequential code has complexity O(TKND), where T is the number of iterations,
K is number of clusters, N is number of data points, and D is the dimension of each data point. There are
two main parts in the code, the first part is to reassign each data point to the nearest center, this requires
to compute the distance between each data point with each cluster center for N data points, and thus the
complexity is O(NKD) for each iteration step. The second part is to compute the center for each new cluster
after the reassignment, it basically requires to compute K groups of means for N datas, and the complexity
is O((N + K) ∗ D) for each iteration step. Apparently, part 1 is the dominant time consuming part, and since
part 1 is an independent process for each data point, this inspires us to parallelize part1. The platform of
parallel code can ba MPI, CILK, OPENMP, CUDA, we will use CUDA on comet to do the parallelization
because of the high efficency of GPU processing large scale data.
2 Parallelization Of K-Means Using CUDA
In this section, we first introduce CUDA and GPUs nodes on Comet, and secondly discuss parallelization
strtegies and CUDA implementation on comet, the last part is to use CUDA Occupancy Calculator to
determine the optimal number of Threads per Block.
2.1 CUDA
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model
invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of
the graphics processing unit (GPU).
The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of
real-time high-resolution 3D graphics compute-intensive tasks. GPUs have evolved into highly parallel
multi-core systems allowing very efficient manipulation of large blocks of data. In the computer game
industry, GPUs are used for graphics rendering, and for game physics calculations (physical effects such as
debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate
non-graphical applications in computational biology, cryptography and other fields by an order of magnitude
or more.
CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the
Tesla line.
2.1.1 NVIDIA GPUs on Comet
The Comet infrastructure provides 36 GPU Nodes. It contains the NVIDIA K80 GPUs. This series is
popularly called the Tesla GPU series by NVIDIA.
GPUs 2 NVIDIA K-80
Cores or socket 12
Sockets 2
Clock speed 2.5 GHz
Memory capacity 128 GB DDR4 DRAM
Memory bandwidth 120 GB/s
Flash memory 320 GB
Table 1: GPU node in comet
Figure 1: Left is how CPU process data, right is how GPU process data. In CPU, each core will process n/p
dates, and in GPU each thread get access to one single data.
3
Nvidia Tesla is Nvidia’s brand name for their products targeting stream processing and/or general purpose
GPU.
Tesla is Nvidia’s first microarchitecture to implement unified shaders. It was used with GeForce 8 Series,
GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs manufactured
in 90 nm, 80 nm, 65 nm, and 55 nm. It also found use in the GeForce 405, and in the workstation market in
the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules.
With their very high computational power (measured in floating point operations per second or FLOPS)
compared to microprocessors, the Tesla products target the high performance computing market.
Physical Limits for GPU Compute Capability=3.7 (Tesla X80) are in Table 2.
Threads per Warp 32
Max Warps per Multiprocessor 64
Max Thread Blocks per Multiprocessor 16
Max Threads per Multiprocessor 2048
Maximum Thread Block Size 1024
Registers per Multiprocessor 131072
Max Registers per Thread Block 65536
Max Registers per Thread 255
Shared Memory per Multiprocessor (bytes) 114688
Max Shared Memory per Block 49152
Register allocation unit size 256
Register allocation granularity warp
Shared Memory allocation unit size 256
Warp allocation granularity 4
Table 2: Physical limits for GPU Compute Capability=3.7 (Tesla X80)
Figure 2: Thread Organization in CUDA
2.2 Parallelization of K-Means clustering on CUDA
CUDA uses CPU as its HOST, and GPU as its DEVICE. Each GPU node can get access to thousands of
threads, and each thread is processing one single data. The threads are grouped into block and shared
memory is restricted to each block. HOST and DEVICE do not share memory. Under this configuration,
we would have to mannully communicate message between HOST and DEVICE.
As explained in the last part of sec 1.2, we aim to parallelize the reassignment step for computing
distance between each data point and each cluster center. The logic and order of parallel algorithm is totally
the same with original sequantial algorithm, and we have to take into account the communication between
HOST and DEVICE in parallel algorithm:
Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE.
Step1: DEVICE copy data membership and K cluster centers from HOST.
Step2: In DEVICE, each thread process a single data point, compute the distance between each cluster
center and update data membership. tid=blockDim.x * blockIdx.x + threadIdx.x.
Step3: HOST Copy the new data membership from DEVICE, and recompute cluster centers.
Step4: Repeat step 1-3 if not converges, go to step 5 if converges.
Step5: Host free the allocated memory.
There are several crucial points in parallel code, first, it is easier to handle 1D array than 2D array for
threads in GPU, thus we convert N data points(in D dimension) from a 2D array to a 1D array in HOST
4
and then send it to DEVICE. i.e. DEVICEdata[i*numCoordinates+j]=HOSTdata[i][j] (the jth coordinate of
the ith data point). Secondly, since different blocks do not share memory, we have to reduce the number of
membership change in each block to compute the total number of membership change.
In our implementation, we set: NumberThreadsPerClusterBlock=128
NumClusterBlocks=
(N+numThreadsPerClusterBlock − 1)
numThreadsPerClusterBlock
The correctness of the parallel algorithm is guaranteed, measured by that it produces the same clustering
as the original sequential k-means algorithm. Our implementation performs the same steps as the sequential
code in parallel without changing the logic, thus the correctness is expected.
2.3 Determining Optimal Number of Threads per Block
We used the CUDA Occupancy Calculator provided by NVIDIA to determine the optimal number of
threads per block.
2.3.1 Code Analysis
To determine the resource usage for each of the CUDA threads for our nearest cluster determining kernel,
we compiled our code with the ptxas nvcc option. The following is the ouput of the compilation:
ptxas i n f o : 0 bytes gmem
nvcc −g −pg −I . −DBLOCK_SHARED_MEM_OPTIMIZATION=0 −−ptxas −o p t i o n s =−v
−o cuda_kmeans . o −c cuda_kmeans . cu
ptxas i n f o : 0 bytes gmem
ptxas i n f o : Compiling e n t r y f u n c t i o n ’ _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ ’
f o r ’sm_20 ’
ptxas i n f o : Function p r o p e r t i e s f o r _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _
0 bytes s t a c k frame , 0 bytes s p i l l s t o r e s , 0 bytes s p i l l loads
ptxas i n f o : Used 18 r e g i s t e r s , 80 bytes cmem[ 0 ]
2.3.2 CUDA Occupancy Calculator
The CUDA Occupancy Calculator [3] allows you to compute the multiprocessor occupancy of a GPU by a
given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of
warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers
available for use by CUDA thread programs. These registers are a shared resource that are allocated among
the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage
to maximize the number of thread blocks that can be active in the machine simultaneously. If a program
tries to launch a kernel for which the registers used per thread times the thread block size is greater than N,
the launch will fail.
Maximizing the occupancy can help to cover latency during global memory loads that are followed by a
__syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each
thread block. Because of this, programmers need to choose the size of thread blocks with care in order to
maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on
shared memory and register requirements.
For any input size, the shared memory used by our program is null. We use the CUDA Occupancy
Calculator with Number of threads per block = 128. From the dis-assembly of code, we find that 18
registers are required by the kernel function we have. We provide the Compute Capacity - 3.7 (GK210,
X80), number of threads per block - 128, shared memory size - 112KB (for 3.7) and number of registers
required per thread = 18, as input to the occupancy calculator. We indicate the results in the figures included.
Number of threads per block = 128 gave maximum occupancy for our program.
5
Figure 3: Input to CUDA Occupancy Calculator
Figure 4: Output of CUDA Occupancy Calculator
Figure 5: Impact of varying Block Size
Figure 6: Impact of varying Register Count Per Thread
Figure 7: Impact of varying Shared Memory Usage Per Block
6
3 Parallel Performance Analysis
In this section, we present several experiment result to reveal the parallel performance.
1. Experiment1: Vary the size of data set N, fix number of clusters K=128. The performance result
is on table 2 and figure 9. Data dimension is 1000.
For K=128 fixed, as the size of data sets N increase, parallel speed up is around 40 and increase
gradually. The memory capacity for GPU is 11GB in comet, and we have tested for 8GB data set
(2048000*1000), it turns out that the parallel running time is 6175sec (about 1.7 hour), the sequential
code is too slow to get the time, and we expect the time to be around 3 days. The parallel code
outperforms the sequential code a lot.
Size(float values) Sequential(in sec) Parallel (in sec) Speed up
51200*1000 463.09 11.73 39.5
76800*1000 857.73 19.25 44.55
89600*1000 1182.43 24.82 47.6
115200*1000 1676.96 35.22 47.6
128000*1000 1794.91 41.23 43.53
512000*1000 >4hrs 405.56 NA
2048000*1000 > 6174.72 -
Table 3: Experiment1. Parallel Performance when varying size, fix K=128
Figure 8: Experiment1. Parallel Speedup versus size of data set. K=128
2. Experiment2: For fixed size=51200*1000, vary the number of clusters, use δ/N < 0.001 as stop
condition. We have tested for K=4, 16, 64, 128, 256, 512, 1024, 2048, and the result is in table 3 and
figure 10, 11.
Figure 10 presents that parallel speed up keeps increasing as K increase, and the derivative of the
curve decrease. This matches with our expectation. The computational cost of part 1 in each iteration
for sequential code is O(N*D*K), of part2 is O((N+K)*D), in parallel code, only part1 has been
parallelized. After running T iterations, the speed up will be:
t1
tp
=
O(NDKT) + O((N + K)DT)
parallized + O((N + K)DT)
As K being larger, since N >> K, D > K, and N, D are both fixed, the time consuming of part2 is
steady, and the larger K is, the more speedup will be earned by part1. Thus overall the speed up
increases. Figure 11 shows that as K increase, for fixed N, the number of iteration needed to converge
decrease, this drives down the parallel running time even K increase at the first stage.
7
K(num of clusters) numofiteration Sequential(in sec) Parallel (in sec) Speed up
4 71 63.81 18.04 3.54
16 51 155.64 15.06 10.33
64 29 338.55 10.51 32.20
128 20 463.38 11.06 41.88
256 16 739.15 12.30 60.11
512 12 1105.98 13.14 84.14
1024 10 1842.00 18.98 97.04
2048 6 2207.78 21.19 104.17
Table 4: Experiment2.Performance when varying number of clusters K for fixed data size
Figure 9: Experiment2.Speed up versus number of clusters K for fixed data size
Figure 10: Experiment2.Parallel running time, numofiteration versus number of clusters K for fixed data
size
8
3. Experiment3: For fixed size=51200*1000, vary the number of clusters, fix number of itera-
tion=30. Tested for K=4, 16, 64, 128, 256. The result is in table 5 and Figure 12,13.
As in Experiment2, we get outstanding speed up and the speedup goes up as K goes up, as shown in
figure 12. Experiment3 reveal another scaling fact of the code, in figure 13 we plot the relationship
between log2(K) with log2(running time), for sequential case, it is very close to a straight line
with slope 1, and this coincide with the face that as K double, sequential running time will double,
considering the complexity to be O(NDKT) + O((N + K)DT, and for parallel case, it is a curve with
increasing slope which is always less than 1, and the fitting slope is 0.28. This implies a fact that the
parallelism is larger as K being larger, and gained more speedup.
K(num of clusters) Sequential(in sec) Parallel (in sec) Speed up
4 27.82 7.72 3.60
8 50.03 7.89 6.34
16 94.50 8.33 11.34
32 183.56 9.50 19.32
64 361.92 12.31 29.40
128 718.31 15.91 45.15
256 1430.74 21.76 65.76
512 2858.05 32.47 88.01
Table 5: Experiment3. Performance when changing K, fix numiteration=30
Figure 11: Experiment3.Speedup versus number of clusters K for fixed data size and fixed iterations
Figure 12: Experiment3: Rate of growth of sequential and parallel implementations
9
3.1 Profiling
nvprof[2] presents an overview of the GPU kernels and memory copies in our program. The summary,
groups all calls to the same kernel together, presenting the total time and percentage of the total application
time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that
let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all
CUDA API calls.
We perform 4 mallocs - one for the input 2D data, one for 2D cluster data, a 1D membership and one 1D
membership changed array. In every iteration, we perform two Device-Host copies for - membership and
membershipChanged and one Host-Device copy, copying the new cluster centers.
==180922== NVPROF i s p r o f i l i n g proces s 180922 , command : . / t e s t _ d r i v e r
==180922== P r o f i l i n g a p p l i c a t i o n : . / t e s t _ d r i v e r
N = 51200
dimension = 1000
k = 128
t h r e s h o l d = 0.0010
Type : P a r a l l e l
Computation timing = 11.8963 sec
Loop i t e r a t i o n s = 21
==180922== P r o f i l i n g r e s u l t :
Time(%) Time C a l l s Avg Min Max Name
98.99% 4.11982 s 21 196.18ms 195.58ms 197.96ms
f i n d _ n e a r e s t _ c l u s t e r ( int , int , int , f l o a t ∗ , f l o a t ∗ , i n t ∗ , i n t ∗)
0.98% 40.635ms 23 1.7668ms 30.624 us 39.102ms [CUDA memcpy HtoD ]
0.03% 1.2578ms 42 29.946 us 28.735 us 31.104 us [CUDA memcpy DtoH ]
==180922== API c a l l s :
Time(%) Time C a l l s Avg Min Max Name
93.06% 4.12058 s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize
5.79% 256.47ms 4 64.117ms 4.9510 us 255.97ms cudaMalloc
1.02% 45.072ms 65 693.42 us 82.267 us 39.230ms cudaMemcpy
0.06% 2.6048ms 332 7.8450 us 528 ns 282.45 us c u D e vi c e G e t A t t r ib u t e
0.03% 1.3272ms 4 331.79 us 297.63 us 344.04 us cuDeviceTotalMem
0.02% 694.81 us 21 33.086 us 29.098 us 47.856 us cudaLaunch
0.01% 505.33 us 3 168.44 us 7.3170 us 330.65 us cudaFree
0.01% 237.02 us 4 59.255 us 56.222 us 67.280 us cuDeviceGetName
0.00% 119.83 us 147 815 ns 533 ns 12.646 us cudaSetupArgument
0.00% 28.884 us 21 1.3750 us 1.1180 us 1.7890 us cudaConfigureCall
0.00% 16.784 us 21 799 ns 740 ns 922 ns cudaGetLastError
0.00% 4.7380 us 8 592 ns 532 ns 788 ns cuDeviceGet
0.00% 3.9500 us 2 1.9750 us 895 ns 3.0550 us cuDeviceGetCount
4 Conclusion and Future Work
Our analysis depicts that we obtain a significant speedup (45X average) over the sequential execution of
K-Means clustering. In our project, we only parallelized the method to compute the nearest cluster. We
optimized the calculation of new clusters centers by adding members that changed membership to new
cluster groups and subtracting it from the old cluster center. This approach as opposed to re-calculating
cluster centers afresh, saved running significant time. Due to shortage of time, we were not able to quantify
the new speedup. There is definitely scope for increased speedup, (when input dimension increases) if we
attempt to parallelize the new cluster center calculation using CUDA.
10
References
[1] KMeans Algorithm : http://users.eecs.northwestern.edu/ wkliao/Kmeans/index.html, Wei-keng Liao,
Northwestern University, 2005
[2] NVPROF : http://docs.nvidia.com/cuda/profiler-users-guide/#axzz4SHzfjCkf
[3] CUDA Occupancy Calculator : https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy-
calculator-helps-pick-optimal-thread-block-size/
[4] Understanding CUDA : https://courses.engr.illinois.edu/ece498al/Syllabus.html
11

Weitere ähnliche Inhalte

Was ist angesagt?

IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」Preferred Networks
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
 

Was ist angesagt? (20)

IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」IIBMP2019 講演資料「オープンソースで始める深層学習」
IIBMP2019 講演資料「オープンソースで始める深層学習」
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
Debugging CUDA applications
Debugging CUDA applicationsDebugging CUDA applications
Debugging CUDA applications
 
Cuda
CudaCuda
Cuda
 
第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)第11回 配信講義 計算科学技術特論A(2021)
第11回 配信講義 計算科学技術特論A(2021)
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 

Andere mochten auch

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Igor Korkin
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedTommaso Campari
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDAJörn Dinkla
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsMarcos Gonzalez
 
PL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsPL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsKohei KaiGai
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Titus Damaiyanti
 
Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)Sara Truscott
 
Grm 201 project
Grm 201 projectGrm 201 project
Grm 201 projectnmjameson
 
Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)Raffi Krikorian
 
China's Younger Architects 2014
China's Younger Architects 2014China's Younger Architects 2014
China's Younger Architects 2014Joe Carter
 
Engaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I DieEngaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I DieNitin Karkara
 
Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas. Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas. ProColombia
 

Andere mochten auch (19)

Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
Acceleration of Statistical Detection of Zero-day Malware in the Memory Dump ...
 
Implementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-basedImplementazione di un vincolo table su un CSP solver GPU-based
Implementazione di un vincolo table su un CSP solver GPU-based
 
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDASchulung: Einführung in das GPU-Computing mit NVIDIA CUDA
Schulung: Einführung in das GPU-Computing mit NVIDIA CUDA
 
CUDA-Aware MPI
CUDA-Aware MPICUDA-Aware MPI
CUDA-Aware MPI
 
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel ApplicationsGPU, CUDA, OpenCL and OpenACC for Parallel Applications
GPU, CUDA, OpenCL and OpenACC for Parallel Applications
 
PL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database AnalyticsPL/CUDA - GPU Accelerated In-Database Analytics
PL/CUDA - GPU Accelerated In-Database Analytics
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)Social Media Basics & Application (for Indexers)
Social Media Basics & Application (for Indexers)
 
Grm 201 project
Grm 201 projectGrm 201 project
Grm 201 project
 
Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)Intro to developing for @twitterapi (updated)
Intro to developing for @twitterapi (updated)
 
China's Younger Architects 2014
China's Younger Architects 2014China's Younger Architects 2014
China's Younger Architects 2014
 
Engaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I DieEngaging Teens through Sprite Digital Campaign - Teen Till I Die
Engaging Teens through Sprite Digital Campaign - Teen Till I Die
 
Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas. Inspección de flores, etiquetas y facturas.
Inspección de flores, etiquetas y facturas.
 
Ank 48
Ank 48Ank 48
Ank 48
 

Ähnlich wie Parallel Implementation of K Means Clustering on CUDA

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelJenny Liu
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)byteLAKE
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET Journal
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda ccsandit
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda ccsandit
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...RSIS International
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 

Ähnlich wie Parallel Implementation of K Means Clustering on CUDA (20)

A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
An35225228
An35225228An35225228
An35225228
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Ultra Fast SOM using CUDA
Ultra Fast SOM using CUDAUltra Fast SOM using CUDA
Ultra Fast SOM using CUDA
 
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel ComputingIRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
IRJET- Performance Analysis of RSA Algorithm with CUDA Parallel Computing
 
Project PPT
Project PPTProject PPT
Project PPT
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Efficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda cEfficient algorithm for rsa text encryption using cuda c
Efficient algorithm for rsa text encryption using cuda c
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
icwet1097
icwet1097icwet1097
icwet1097
 
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETSFAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETS
 
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
Hardware Implementations of RS Decoding Algorithm for Multi-Gb/s Communicatio...
 
Data miningpresentation
Data miningpresentationData miningpresentation
Data miningpresentation
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 

Kürzlich hochgeladen

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 

Kürzlich hochgeladen (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 

Parallel Implementation of K Means Clustering on CUDA

  • 1. CS 240A - Parallel Implementation of K Means Clustering on CUDA Lan Liu, Pritha D N December 9, 2016 Abstract K-Means clustering is a popular clustering algorithm in data mining. Clustering large data sets can be time consuming, and in an attempt to minimize this time, our project is a parallel implementation of K- Means clustering algorithm on CUDA using C. We present the performance analysis and implementation of our approach to parallelizing K-Means clustering. 1 K-Means Clustering In this section, we provide an overview of K-Means clustering, the mathematical description and the sequential algorithm has been presented, and the complexity of sequential code has been analyzed. 1.1 Description K-Means clustering is one of the most widely used clustering method used in data mining, it aims to partition N given data points into K clusters, in which each cluster has the most similarity, namely, each data point belongs to the cluster with the nearest mean. Assume there are N data points in d dimension, (x1, x2, ...., xn) ⊂ Rd, and we need to classify them into K clusters S = (S1, S2, ..., SK ), where K is generally fixed apriori. Define the center of each cluster Si as the mean of all data points x ∈ Si, i.e. µi x∈Si x |Si | , where |Si | denotes size of Si The goal of clustering is to minimize the total Euclidean distance from each data point to its cluster center, i.e. find clustering S to minimize the objective function: cost(S) = k i=1 x∈Si ||x − µi ||2 Finding the global minimum of the objective function is computationally challenging (NP-hard). The commonly used algorithm is really a heuristic which can find a local minimum instead of global minimum, and the trade off is that computational cost is cheaper. The commonly used k-means algorithm is as follows: Step1: Initialize cluster centroids µ1, µ2, ..., µk ∈ Rn randomly. Step2: Repeat until convergence:{ Assignment step: In order to assign each data point xi to the nearest cluster Si, define ci as membership index of xi, ci := argmink j=1||xi − µj ||2 Update centroids step: For each j, update µj := N i=1 1ci =j xi N i=1 1ci =j } Since generally the clustering problem is not convex, there might be a lot of local minimum. For any fixed initial condition, It can be easily proved that cost(S) decreases for every iteration step, and thus the 1
  • 2. algorithm converges to a unique local minimum depending on which initial condition is given. For example, set x = (2, 4, 5, 6), if initial centers are µ1 = 2, µ2 = 5, then S1 = (2), S2 = (4, 5, 6), if initial centers are µ1 = 4, µ2 = 5, then S1 = (2, 4), S2 = (5, 6), the first is global minimal with cost=2, and the second is a saddle with cost=2.5. Consider this, people might generate a couple of result using random initial centers, and choose the best with smallest objective function. And this also drives us to apply parallel computing to save running time. 1.2 Algorithm Based on the description, we apply the following heuristic K Means clustering sequential algorithm, this algorithm is based on the Professor Wei-keng Liao’s k-means clustering code [1]. We have modified the code about how to recompute the new cluster center. In that code it sumerizes all data points in the new cluster to compute center in every iteration step, considering that only a portion of data changes membership (and being less and less as iteration goes), we instead treat the changing data by adding the data into new cluster and removing it from old cluster, this will be more efficient and the result also verifies this. Step 1: Pick the first K data points as initial cluster centers. Step2: Attribute each data point to the nearest cluster. Step3: For each reassigned data point, membership change increase by 1. Step4: Set the position of each new cluster to be the mean of all data points belonging to that cluster. Step 5: Repeat steps 2-4 until convergence. The pseudo code is as follows: Let N be the number of data points, K be the number of clusters. data[N]: the array of data objects center[K]: the array of cluster centers membership[N]: the array of data point membership. clustersum[K]: the sum of data points in Kth cluster. clustersize[K]: the size of Kth cluster. δ: count the number of membership change. threshold: critical value to define stop condition, we set it be 0.001. for i from 0 to K-1 center[i] ←− data[i] do{ δ ←− 0 for i from 0 to N − 1 mindis=||data[i]-center[0]|| for j from 1 to K − 1 distance=||data[i]-center[j]|| if distance<mindis mindis←−distance index←−j if first iteration δ ←−N membership[i]←−index clustersize[index]←−clustersize[index]+1 clustersum[index]←−clustersum[index]+data[i] else if membership[i] index δ = δ + 1 clustersize[index]←−clustersize[index]+1 clustersize[membership[i]]←−clustersize[membership[i]]-1 clustersum[index]←−clustersum[index]+data[i] clustersum[membership[i]]←−clustersum[membership[i]]-data[i] membership[i]←−index for j from 0 to K − 1 center[j]←−clustersum[j]/clustersize[j] } While(δ/N>threshold) 2
  • 3. Note: stop condition is δ/N<threshold, i.e., the number of membership changes is 1‰of all datas. Complexity Analysis: The sequential code has complexity O(TKND), where T is the number of iterations, K is number of clusters, N is number of data points, and D is the dimension of each data point. There are two main parts in the code, the first part is to reassign each data point to the nearest center, this requires to compute the distance between each data point with each cluster center for N data points, and thus the complexity is O(NKD) for each iteration step. The second part is to compute the center for each new cluster after the reassignment, it basically requires to compute K groups of means for N datas, and the complexity is O((N + K) ∗ D) for each iteration step. Apparently, part 1 is the dominant time consuming part, and since part 1 is an independent process for each data point, this inspires us to parallelize part1. The platform of parallel code can ba MPI, CILK, OPENMP, CUDA, we will use CUDA on comet to do the parallelization because of the high efficency of GPU processing large scale data. 2 Parallelization Of K-Means Using CUDA In this section, we first introduce CUDA and GPUs nodes on Comet, and secondly discuss parallelization strtegies and CUDA implementation on comet, the last part is to use CUDA Occupancy Calculator to determine the optimal number of Threads per Block. 2.1 CUDA CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). The graphics processing unit (GPU), as a specialized computer processor, addresses the demands of real-time high-resolution 3D graphics compute-intensive tasks. GPUs have evolved into highly parallel multi-core systems allowing very efficient manipulation of large blocks of data. In the computer game industry, GPUs are used for graphics rendering, and for game physics calculations (physical effects such as debris, smoke, fire, fluids); examples include PhysX and Bullet. CUDA has also been used to accelerate non-graphical applications in computational biology, cryptography and other fields by an order of magnitude or more. CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro and the Tesla line. 2.1.1 NVIDIA GPUs on Comet The Comet infrastructure provides 36 GPU Nodes. It contains the NVIDIA K80 GPUs. This series is popularly called the Tesla GPU series by NVIDIA. GPUs 2 NVIDIA K-80 Cores or socket 12 Sockets 2 Clock speed 2.5 GHz Memory capacity 128 GB DDR4 DRAM Memory bandwidth 120 GB/s Flash memory 320 GB Table 1: GPU node in comet Figure 1: Left is how CPU process data, right is how GPU process data. In CPU, each core will process n/p dates, and in GPU each thread get access to one single data. 3
  • 4. Nvidia Tesla is Nvidia’s brand name for their products targeting stream processing and/or general purpose GPU. Tesla is Nvidia’s first microarchitecture to implement unified shaders. It was used with GeForce 8 Series, GeForce 9 Series, GeForce 100 Series, GeForce 200 Series, and GeForce 300 Series of GPUs manufactured in 90 nm, 80 nm, 65 nm, and 55 nm. It also found use in the GeForce 405, and in the workstation market in the Quadro FX, Quadro x000, Quadro NVS series, and Nvidia Tesla computing modules. With their very high computational power (measured in floating point operations per second or FLOPS) compared to microprocessors, the Tesla products target the high performance computing market. Physical Limits for GPU Compute Capability=3.7 (Tesla X80) are in Table 2. Threads per Warp 32 Max Warps per Multiprocessor 64 Max Thread Blocks per Multiprocessor 16 Max Threads per Multiprocessor 2048 Maximum Thread Block Size 1024 Registers per Multiprocessor 131072 Max Registers per Thread Block 65536 Max Registers per Thread 255 Shared Memory per Multiprocessor (bytes) 114688 Max Shared Memory per Block 49152 Register allocation unit size 256 Register allocation granularity warp Shared Memory allocation unit size 256 Warp allocation granularity 4 Table 2: Physical limits for GPU Compute Capability=3.7 (Tesla X80) Figure 2: Thread Organization in CUDA 2.2 Parallelization of K-Means clustering on CUDA CUDA uses CPU as its HOST, and GPU as its DEVICE. Each GPU node can get access to thousands of threads, and each thread is processing one single data. The threads are grouped into block and shared memory is restricted to each block. HOST and DEVICE do not share memory. Under this configuration, we would have to mannully communicate message between HOST and DEVICE. As explained in the last part of sec 1.2, we aim to parallelize the reassignment step for computing distance between each data point and each cluster center. The logic and order of parallel algorithm is totally the same with original sequantial algorithm, and we have to take into account the communication between HOST and DEVICE in parallel algorithm: Step0: HOST initialize cluster centers, copy N data coordinates to DEVICE. Step1: DEVICE copy data membership and K cluster centers from HOST. Step2: In DEVICE, each thread process a single data point, compute the distance between each cluster center and update data membership. tid=blockDim.x * blockIdx.x + threadIdx.x. Step3: HOST Copy the new data membership from DEVICE, and recompute cluster centers. Step4: Repeat step 1-3 if not converges, go to step 5 if converges. Step5: Host free the allocated memory. There are several crucial points in parallel code, first, it is easier to handle 1D array than 2D array for threads in GPU, thus we convert N data points(in D dimension) from a 2D array to a 1D array in HOST 4
  • 5. and then send it to DEVICE. i.e. DEVICEdata[i*numCoordinates+j]=HOSTdata[i][j] (the jth coordinate of the ith data point). Secondly, since different blocks do not share memory, we have to reduce the number of membership change in each block to compute the total number of membership change. In our implementation, we set: NumberThreadsPerClusterBlock=128 NumClusterBlocks= (N+numThreadsPerClusterBlock − 1) numThreadsPerClusterBlock The correctness of the parallel algorithm is guaranteed, measured by that it produces the same clustering as the original sequential k-means algorithm. Our implementation performs the same steps as the sequential code in parallel without changing the logic, thus the correctness is expected. 2.3 Determining Optimal Number of Threads per Block We used the CUDA Occupancy Calculator provided by NVIDIA to determine the optimal number of threads per block. 2.3.1 Code Analysis To determine the resource usage for each of the CUDA threads for our nearest cluster determining kernel, we compiled our code with the ptxas nvcc option. The following is the ouput of the compilation: ptxas i n f o : 0 bytes gmem nvcc −g −pg −I . −DBLOCK_SHARED_MEM_OPTIMIZATION=0 −−ptxas −o p t i o n s =−v −o cuda_kmeans . o −c cuda_kmeans . cu ptxas i n f o : 0 bytes gmem ptxas i n f o : Compiling e n t r y f u n c t i o n ’ _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ ’ f o r ’sm_20 ’ ptxas i n f o : Function p r o p e r t i e s f o r _ Z 2 0 f i n d _ n e a r e s t _ c l u s t e r i i i P f S _ P i S 0 _ 0 bytes s t a c k frame , 0 bytes s p i l l s t o r e s , 0 bytes s p i l l loads ptxas i n f o : Used 18 r e g i s t e r s , 80 bytes cmem[ 0 ] 2.3.2 CUDA Occupancy Calculator The CUDA Occupancy Calculator [3] allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail. Maximizing the occupancy can help to cover latency during global memory loads that are followed by a __syncthreads(). The occupancy is determined by the amount of shared memory and registers used by each thread block. Because of this, programmers need to choose the size of thread blocks with care in order to maximize occupancy. This GPU Occupancy Calculator can assist in choosing thread block size based on shared memory and register requirements. For any input size, the shared memory used by our program is null. We use the CUDA Occupancy Calculator with Number of threads per block = 128. From the dis-assembly of code, we find that 18 registers are required by the kernel function we have. We provide the Compute Capacity - 3.7 (GK210, X80), number of threads per block - 128, shared memory size - 112KB (for 3.7) and number of registers required per thread = 18, as input to the occupancy calculator. We indicate the results in the figures included. Number of threads per block = 128 gave maximum occupancy for our program. 5
  • 6. Figure 3: Input to CUDA Occupancy Calculator Figure 4: Output of CUDA Occupancy Calculator Figure 5: Impact of varying Block Size Figure 6: Impact of varying Register Count Per Thread Figure 7: Impact of varying Shared Memory Usage Per Block 6
  • 7. 3 Parallel Performance Analysis In this section, we present several experiment result to reveal the parallel performance. 1. Experiment1: Vary the size of data set N, fix number of clusters K=128. The performance result is on table 2 and figure 9. Data dimension is 1000. For K=128 fixed, as the size of data sets N increase, parallel speed up is around 40 and increase gradually. The memory capacity for GPU is 11GB in comet, and we have tested for 8GB data set (2048000*1000), it turns out that the parallel running time is 6175sec (about 1.7 hour), the sequential code is too slow to get the time, and we expect the time to be around 3 days. The parallel code outperforms the sequential code a lot. Size(float values) Sequential(in sec) Parallel (in sec) Speed up 51200*1000 463.09 11.73 39.5 76800*1000 857.73 19.25 44.55 89600*1000 1182.43 24.82 47.6 115200*1000 1676.96 35.22 47.6 128000*1000 1794.91 41.23 43.53 512000*1000 >4hrs 405.56 NA 2048000*1000 > 6174.72 - Table 3: Experiment1. Parallel Performance when varying size, fix K=128 Figure 8: Experiment1. Parallel Speedup versus size of data set. K=128 2. Experiment2: For fixed size=51200*1000, vary the number of clusters, use δ/N < 0.001 as stop condition. We have tested for K=4, 16, 64, 128, 256, 512, 1024, 2048, and the result is in table 3 and figure 10, 11. Figure 10 presents that parallel speed up keeps increasing as K increase, and the derivative of the curve decrease. This matches with our expectation. The computational cost of part 1 in each iteration for sequential code is O(N*D*K), of part2 is O((N+K)*D), in parallel code, only part1 has been parallelized. After running T iterations, the speed up will be: t1 tp = O(NDKT) + O((N + K)DT) parallized + O((N + K)DT) As K being larger, since N >> K, D > K, and N, D are both fixed, the time consuming of part2 is steady, and the larger K is, the more speedup will be earned by part1. Thus overall the speed up increases. Figure 11 shows that as K increase, for fixed N, the number of iteration needed to converge decrease, this drives down the parallel running time even K increase at the first stage. 7
  • 8. K(num of clusters) numofiteration Sequential(in sec) Parallel (in sec) Speed up 4 71 63.81 18.04 3.54 16 51 155.64 15.06 10.33 64 29 338.55 10.51 32.20 128 20 463.38 11.06 41.88 256 16 739.15 12.30 60.11 512 12 1105.98 13.14 84.14 1024 10 1842.00 18.98 97.04 2048 6 2207.78 21.19 104.17 Table 4: Experiment2.Performance when varying number of clusters K for fixed data size Figure 9: Experiment2.Speed up versus number of clusters K for fixed data size Figure 10: Experiment2.Parallel running time, numofiteration versus number of clusters K for fixed data size 8
  • 9. 3. Experiment3: For fixed size=51200*1000, vary the number of clusters, fix number of itera- tion=30. Tested for K=4, 16, 64, 128, 256. The result is in table 5 and Figure 12,13. As in Experiment2, we get outstanding speed up and the speedup goes up as K goes up, as shown in figure 12. Experiment3 reveal another scaling fact of the code, in figure 13 we plot the relationship between log2(K) with log2(running time), for sequential case, it is very close to a straight line with slope 1, and this coincide with the face that as K double, sequential running time will double, considering the complexity to be O(NDKT) + O((N + K)DT, and for parallel case, it is a curve with increasing slope which is always less than 1, and the fitting slope is 0.28. This implies a fact that the parallelism is larger as K being larger, and gained more speedup. K(num of clusters) Sequential(in sec) Parallel (in sec) Speed up 4 27.82 7.72 3.60 8 50.03 7.89 6.34 16 94.50 8.33 11.34 32 183.56 9.50 19.32 64 361.92 12.31 29.40 128 718.31 15.91 45.15 256 1430.74 21.76 65.76 512 2858.05 32.47 88.01 Table 5: Experiment3. Performance when changing K, fix numiteration=30 Figure 11: Experiment3.Speedup versus number of clusters K for fixed data size and fixed iterations Figure 12: Experiment3: Rate of growth of sequential and parallel implementations 9
  • 10. 3.1 Profiling nvprof[2] presents an overview of the GPU kernels and memory copies in our program. The summary, groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls. We perform 4 mallocs - one for the input 2D data, one for 2D cluster data, a 1D membership and one 1D membership changed array. In every iteration, we perform two Device-Host copies for - membership and membershipChanged and one Host-Device copy, copying the new cluster centers. ==180922== NVPROF i s p r o f i l i n g proces s 180922 , command : . / t e s t _ d r i v e r ==180922== P r o f i l i n g a p p l i c a t i o n : . / t e s t _ d r i v e r N = 51200 dimension = 1000 k = 128 t h r e s h o l d = 0.0010 Type : P a r a l l e l Computation timing = 11.8963 sec Loop i t e r a t i o n s = 21 ==180922== P r o f i l i n g r e s u l t : Time(%) Time C a l l s Avg Min Max Name 98.99% 4.11982 s 21 196.18ms 195.58ms 197.96ms f i n d _ n e a r e s t _ c l u s t e r ( int , int , int , f l o a t ∗ , f l o a t ∗ , i n t ∗ , i n t ∗) 0.98% 40.635ms 23 1.7668ms 30.624 us 39.102ms [CUDA memcpy HtoD ] 0.03% 1.2578ms 42 29.946 us 28.735 us 31.104 us [CUDA memcpy DtoH ] ==180922== API c a l l s : Time(%) Time C a l l s Avg Min Max Name 93.06% 4.12058 s 21 196.22ms 195.62ms 198.00ms cudaDeviceSynchronize 5.79% 256.47ms 4 64.117ms 4.9510 us 255.97ms cudaMalloc 1.02% 45.072ms 65 693.42 us 82.267 us 39.230ms cudaMemcpy 0.06% 2.6048ms 332 7.8450 us 528 ns 282.45 us c u D e vi c e G e t A t t r ib u t e 0.03% 1.3272ms 4 331.79 us 297.63 us 344.04 us cuDeviceTotalMem 0.02% 694.81 us 21 33.086 us 29.098 us 47.856 us cudaLaunch 0.01% 505.33 us 3 168.44 us 7.3170 us 330.65 us cudaFree 0.01% 237.02 us 4 59.255 us 56.222 us 67.280 us cuDeviceGetName 0.00% 119.83 us 147 815 ns 533 ns 12.646 us cudaSetupArgument 0.00% 28.884 us 21 1.3750 us 1.1180 us 1.7890 us cudaConfigureCall 0.00% 16.784 us 21 799 ns 740 ns 922 ns cudaGetLastError 0.00% 4.7380 us 8 592 ns 532 ns 788 ns cuDeviceGet 0.00% 3.9500 us 2 1.9750 us 895 ns 3.0550 us cuDeviceGetCount 4 Conclusion and Future Work Our analysis depicts that we obtain a significant speedup (45X average) over the sequential execution of K-Means clustering. In our project, we only parallelized the method to compute the nearest cluster. We optimized the calculation of new clusters centers by adding members that changed membership to new cluster groups and subtracting it from the old cluster center. This approach as opposed to re-calculating cluster centers afresh, saved running significant time. Due to shortage of time, we were not able to quantify the new speedup. There is definitely scope for increased speedup, (when input dimension increases) if we attempt to parallelize the new cluster center calculation using CUDA. 10
  • 11. References [1] KMeans Algorithm : http://users.eecs.northwestern.edu/ wkliao/Kmeans/index.html, Wei-keng Liao, Northwestern University, 2005 [2] NVPROF : http://docs.nvidia.com/cuda/profiler-users-guide/#axzz4SHzfjCkf [3] CUDA Occupancy Calculator : https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy- calculator-helps-pick-optimal-thread-block-size/ [4] Understanding CUDA : https://courses.engr.illinois.edu/ece498al/Syllabus.html 11