Accelerating GPU kernels for Dense Linear Algebra
Rajib Nath (Innovative Computing Laboratory, University of Tennessee, Knoxville)Stanimire Tomov (Innovative Computing Laboratory, University of Tennessee, Knoxville)
Jack Dongarra (Innovative Computing Laboratory, University of Tennessee, Knoxville)
Abstract:
Implementations of the Basic Linear Algebra Subprograms (BLAS) interface
are major building block of dense linear algebra (DLA) libraries, and
therefore have to be highly optimized. We present some techniques and
implementations that significantly accelerate the corresponding routines
from currently available libraries for GPUs. In particular,
Pointer Redirecting -- a set of GPU specific optimization techniques --
allows us to easily remove performance oscillations associated with problem
dimensions not divisible by fixed blocking sizes. For example, applied to the
matrix-matrix multiplication routines, depending on the hardware configuration
and routine parameters, this can lead to two times faster algorithms.
Similarly, the matrix-vector multiplication can be accelerated more than
two times in both single and double precision arithmetic. Additionally,
GPU specific acceleration techniques are applied to develop new kernels
(e.g. syrk, symv) that are up to 20 times faster than the currently
available kernels. We present these kernels and also show their acceleration
effect to higher level dense linear algebra routines. The accelerated
kernels are now freely available through the MAGMA BLAS library.
Keywords:
Parallel and Distributed Computing, Numerical Algorithms for CS&E, Performance Analysis