Parallel Clustering on GPU's


The applications in science are creating huge amount of data sets. These data sets need to be classified into subsets in order to draw some meaningful conclusions. Data clustering is the statistical analysis process that groups similar objects into relatively homogeneous sets which are called clusters. The computational demands of data clustering grow rapidly. And it is very time consuming for single CPU to processing large data sets. To address this computational demands, we investigated a CUDA based high performance solution to two data clustering algorithms: K-means and C-means. The k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Fuzzy C-means is an algorithm of clustering that allows one element to belong to two or more clusters with different probabilities.

For this project, we propose a low cost, user friendly, high performance solution that used to perform parallel clustering on GPU's. Two data clustering algorithms, Kmeans and Cmeans were implemented with our proposed solution. The optimization issues of our CUDA based implementation for these two clustering algorithms, such as tiling of input and intermediate data, and splitting of computational kernel, are studied. The illustration on how to implement the global reduction primitives, and how to apply them to Kmeans were given. The performance of sequential, single GPU, and multiple GPU implementations of the two data clustering algorithms are compared in detail.

The parallel clustering on GPUs project consists of three subprojects, GlobalReductions, Cmeans, and Runtime tool. The timeline and associated pdf reports are as follows:

1. GlobalReductions (Jan/2013~now) (PDF)
2. Cmeans (May/2012~Sep/2012) (PDF) (DataSet)
3. Runtime (Aug/2012~Dec/2012) (PDF) (PPT)

Intellectual Merit

We expect to develop a low cost, user friendly, high performance framework for parallel clustering on GPU's. In programmability side, it provided the developers with a unique global reduction programming interface that hide implementation and optimization details. In performance side, it showed the much better performance for Kmeans and Cmeans as compared with corresponding multi-core CPU implementation; and it showed scalable performance as opposed to single GPU results. This research work is expected to aware the neccessary and advantages of providing global reductions primitives for parallel clustering computation.

The source code of Cmeans and Runtime tool project are archived in Github at:
Cmeans Code:
Panda V0.3: (Aug2012~Dec2012)
Panda V0.4: (Jan2013~April2013)

Broader Impact

Three PhD students working on this project will make usage of high performance infrastructure on Delta cluster on FutureGrid. The research report will be archived as FG document and project code will be opened to FG users as well. In addition, some experience of this project have been written as tutorial pages in FutureGrid portal.

Use of FutureGrid

We will use FutureGrid to run experiments on multiple nodes of the Delta system. The software stack of our program including: CUDA, Pthread, OpenMP, MPI. Some middle size input data will be stored in file system on Detla nodes.

Scale Of Use

between 1~10 nodes on Delta cluster.



Here are some performance results of Cmeans, GlobalReductions, and Runtime tool projects.

Figure 1: Speedup of MPI/OpenMP implmenetation of C-means on multiple GPUs.

Figure 1 shows the speedup of MPI/OpenMP/CUDA implementation of C-means for 7 million events using up to 18 GPU cards (9nodes with 2 cards each) on GPU cluster. The kernel speedup is cacluated by only measuring the GPU kernel overhead, while overall speedup is caculated by measuring GPU kernel, CPU overhead, and memcpy between device and host memory. As expected, the kernel speedup is higher than overall speedup which contains overhead in sequetnail component. In addition, as showed in Figure 1, there is big performance fluctuation for different number of GPU nodes due to the memory coalesced issue related with input granularity.

Figure 2: performance of Kmeans with different runtime technologies.

We evaluated performance of Kmeans application with GlobalReduction method and different runtime technologies including mpi, hadoop and mahout on four nodes on Delta cluster. The results indicate that mpi-cuda implementation can give a speedup of 14 over mpi-openmp for large data sets. And hadoop-cuda is 1.15x and 1.04x faster than hadoop-openmp and hadoop-java respectively. The hadoop-cuda didn’t have much performance improvement because it has to load data from disk to memory and then to gpu device memory during each iterations, while the mpi implementation can cache the static data in device memory during each iterations. The results also showed that the standard implementation mahout is 1.76x slower than our hadoop implementation. This is because our Hadoop implementation uses much coarse granularity task, and it can get performance improvement by leveraging the local reduction, while mahout implementation uses much finer granularity for each map task, which trigger larger communication overhead during shuffle stage. The results also indicate that panda-cuda implementation is 132.13 times faster than Mahout, but is 2.37 times slower than mpi-cuda implementation

 Figure 3: Speedup Performance of Matrix Multiplication Jobs using Panda-1GPU-HostMap, Panda-1GPU-DeviceMap, Panda-1GPU-DeviceMap+24CPU, MAGAMA-1GPU, MAGMA-1GPU+24CPU, and CUDA-1GPU implementations on Delta machine.

Figure 3 shows the speedup performance of matrix multiplication jobs using Panda-1GPU-DeviceMap, Panda-1GPU-HostMap, Panda-24CPU, Panda-1GPU-DeviceMap+24CPU, MAGMA-1GPU, MAGMA-1GPU+24CPU, CUDA-1GPU, Mars-1GPU, and Phoenix-24CPU. The CUDA-1GPU implementation is around 1.52~1.94x faster than Panda-1GPU-DeviceMap for large matrices sizes. The Mars and Phoenix crashed when the matrices sizes larger than 5000 and 3000 respectively. For 3000x3000 matrix multiplication job, Panda-1GPU-DeviceMap achieves the speedup of 15.86x, and 7.68x over Phoenix and Mars respectively. Panda-1GPU-HostMap is only a little slower than CUDA-1GPU for large matrices. Panda 1GPU-DeviceMap+24CPU improve the performance by 5.1% over Panda-1GPU on average. The workload distribution among GPU and CPU is 90/10 as calculated by auto tuning utility. MAGMA-1GPU+24CPU increase the performance by 7.2% over MAGMA-1GPU, where the workload distribution among GPU and CPU is determined by its auto tuning utility
Gregor von Laszewski
Hui Li
Indiana University

Project Members

Fugang Wang
Gregor von Laszewski
Hui Li

FutureGrid Experts

Koji Tanaka