Loading...
Load Partitioning for Matrix-Matrix Multiplication on a Cluster of CPUGPU Nodes Using the Divisible Load Paradigm
Elhiny, Lamees
Elhiny, Lamees
Description
A Master of Science thesis in Computer Engineering by Lamees Elhiny entitled, "Load Partitioning for Matrix-Matrix Multiplication on a Cluster of CPUGPU Nodes Using the Divisible Load Paradigm," submitted in November 2016. Thesis advisor is Dr. Gerassimos Barlas. Soft and hard copy available.
Abstract
Matrix-matrix multiplication is a component of many numerical algorithms; however, it is a time consuming operation. Sometimes, when the matrix size is huge, the processing of the matrix-matrix multiplication on a single processor in not sufficiently fast. Finding an approach for efficient matrix-matrix multiplication can scale the performance of several applications that depend on it. The aim of this study is to improve the efficiency of matrix-matrix multiplication on a distributed network composed of heterogeneous nodes. Since load balancing between heterogeneous nodes forms the biggest challenge, the performance model is derived using the Divisible Load Theory (DLT). The proposed solution improves performance by: (a) the reduction of communication overhead, as DLT-derived load partitioning does not require synchronization between nodes during processing time, and (b) high utilization of resources, as both Control Processing Unit (CPU) and Graphical Processing Unit (GPU) are used in the computation. The experiments are conducted on a single node as well as a cluster of nodes. The results prove that the use of DLT equations balances the load between CPUs and GPUs. On a single node, the suggested hybrid approach has superior performance when compared to C Basic Linear Algebra Subroutines (cBLAS) and OpenMP Basic Linear Algebra Subroutines (openBLAS) approaches. On the other hand, the performance difference between the hybrid and GPU only (CUDA Basic Linear Algebra Subroutines) approaches is mild as the majority of the load in the hybrid approach is allocated to the GPU. On a cluster of nodes, the computation time is reduced to almost half of the GPU only processing time; however, the overall improvement is impeded by communication overhead. It is expected that faster communication media could reduce the overall time and further improve speedup.