This example illustrates the use of the rocBLAS Level 3 Strided Batched General Matrix Multiplication. The rocBLAS GEMM STRIDED BATCHED performs a matrix--matrix operation for a batch of matrices as:
for each
-
$f(X) = X$ or -
$f(X) = X^T$ (transpose$X$ :$X_{ij}^T = X_{ji}$ ) or -
$f(X) = X^H$ (Hermitian$X$ :$X_{ij}^H = \bar X_{ji} $ ).
- Read in command-line parameters.
- Set
$f$ operation, set sizes of matrices and get batch count. - Allocate and initialize the host matrices. Set up
$B$ matrix as an identity matrix. - Initialize gold standard matrix.
- Compute CPU reference result with strided batched subvectors.
- Allocate device memory.
- Copy data from host to device.
- Create a rocBLAS handle.
- Invoke the rocBLAS GEMM STRIDED BATCHED function.
- Copy the result from device to host.
- Destroy the rocBLAS handle, release device memory.
- Validate the output by comparing it to the CPU reference result.
The application provides the following optional command line arguments:
-
-a
or--alpha
. The scalar value$\alpha$ used in the GEMM operation. Its default value is 1. -
-b
or--beta
. The scalar value$\beta$ used in the GEMM operation. Its default value is 1. -
-c
or--count
. Batch count. Its default value is 3. -
-m
or--m
. The number of rows of matrices$f(A_i)$ and$C_i$ , which must be greater than 0. Its default value is 5. -
-n
or--n
. The number of columns of matrices$f(B_i)$ and$C_i$ , which must be greater than 0. Its default value is 5. -
-k
or--k
. The number of columns of columns of matrix f(A_i) and rows of f(B_i)
-
The performance of a numerical multi-linear algebra code can be heavily increased by using tensor contractions [ Y. Shi et al., HiPC, pp 193, 2016. ], thereby most of the rocBLAS functions have a
_batched
and a_strided_batched
[ C. Jhurani and P. Mullowney, JPDP Vol 75, pp 133, 2015. ] extensions.
We can apply the same multiplication operator for several matrices if we combine them into batched matrices. Batched matrix multiplication has a performance improvement for a large number of small matrices. For a constant stride between matrices, further acceleration is available by strided batched GEMM.
-
rocBLAS is initialized by calling
rocblas_create_handle(rocblas_handle*)
and it is terminated by callingrocblas_destroy_handle(rocblas_handle)
. -
The pointer mode controls whether scalar parameters must be allocated on the host (
rocblas_pointer_mode_host
) or on the device (rocblas_pointer_mode_device
). It is controlled byrocblas_set_pointer_mode
. -
rocblas_stride
strides between matrices or vectors in strided_batched functions. -
rocblas_[sdhcz]gemm_strided_batched
Depending on the character matched in
[sdhcz]
, the norm can be obtained with different precisions:-
s
(single-precision:rocblas_float
) -
d
(double-precision:rocblas_double
) -
h
(half-precision:rocblas_half
) -
c
(single-precision complex:rocblas_complex
) -
z
(double-precision complex:rocblas_double_complex
).
Input parameters:
rocblas_handle handle
-
rocblas_operation transA
: transformation operator on$A_i$ matrix -
rocblas_operation transB
: transformation operator on$B_i$ matrix -
rocblas_int m
: number of rows in$f(A_i)$ and$C_i$ matrices -
rocblas_int n
: number of columns in$f(B_i)$ and$C_i$ matrices -
rocblas_int k
: number of columns in$f(A_i)$ matrix and number of rows in$f(B_i)$ matrix -
const float *alpha
: scalar multiplier of$C_i$ matrix addition -
const float *A
: pointer to each$A_i$ matrix -
rocblas_int lda
: leading dimension of each$A_i$ matrix -
rocblas_stride stride_a
: stride size for each$A_i$ matrix -
const float *B
: pointer to each$B_i$ matrix -
rocblas_int ldb
: leading dimension of each$B_i$ matrix -
const float *beta
: scalar multiplier of the$B \cdot C$ matrix product -
rocblas_stride stride_b
: stride size for each$B_i$ matrix -
float *C
: pointer to each$C_i$ matrix -
rocblas_int ldc
: leading dimension of each$C_i$ matrix -
rocblas_stride stride_c
: stride size for each$C_i$ matrix -
rocblas_int batch_count
: number of matrices
Return value:
rocblas_status
-
rocblas_int
rocblas_float
rocblas_operation
rocblas_operation_none
rocblas_handle
rocblas_create_handle
rocblas_destroy_handle
rocblas_set_pointer_mode
rocblas_pointer_mode_host
rocblas_stride
rocblas_sgemm_strided_batched
hipMalloc
hipFree
hipMemcpy
hipMemcpyHostToDevice
hipMemcpyDeviceToHost