Skip to content

Latest commit

 

History

History
 
 

gemm_strided_batched

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

rocBLAS Level 3 Generalized Matrix Multiplication Strided Batched Example

Description

This example illustrates the use of the rocBLAS Level 3 Strided Batched General Matrix Multiplication. The rocBLAS GEMM STRIDED BATCHED performs a matrix--matrix operation for a batch of matrices as:

$C[i] = \alpha \cdot f(A[i]) \cdot f(B[i]) + \beta \cdot (C[i])$

for each $i \in [0, batch - 1]$, where $X[i] = X + i \cdot strideX$ is the $i$-th element of the correspondent batch and $f(X)$ is one of the following:

  • $f(X) = X$ or
  • $f(X) = X^T$ (transpose $X$: $X_{ij}^T = X_{ji}$) or
  • $f(X) = X^H$ (Hermitian $X$: $X_{ij}^H = \bar X_{ji} $).

$\alpha$ and $\beta$ are scalars, and $A$, $B$ and $C$ are the batches of matrices. For each $i$, $A[i]$, $B[i]$ and $C[i]$ are matrices such that $f(A[i])$ is an $m \times k$ matrix, $f(B[i])$ a $k \times n$ matrix and $C[i]$ an $m \times n$ matrix.

Application flow

  1. Read in command-line parameters.
  2. Set $f$ operation, set sizes of matrices and get batch count.
  3. Allocate and initialize the host matrices. Set up $B$ matrix as an identity matrix.
  4. Initialize gold standard matrix.
  5. Compute CPU reference result with strided batched subvectors.
  6. Allocate device memory.
  7. Copy data from host to device.
  8. Create a rocBLAS handle.
  9. Invoke the rocBLAS GEMM STRIDED BATCHED function.
  10. Copy the result from device to host.
  11. Destroy the rocBLAS handle, release device memory.
  12. Validate the output by comparing it to the CPU reference result.

Command line interface

The application provides the following optional command line arguments:

  • -a or --alpha. The scalar value $\alpha$ used in the GEMM operation. Its default value is 1.
  • -b or --beta. The scalar value $\beta$ used in the GEMM operation. Its default value is 1.
  • -c or --count. Batch count. Its default value is 3.
  • -m or --m. The number of rows of matrices $f(A_i)$ and $C_i$, which must be greater than 0. Its default value is 5.
  • -n or --n. The number of columns of matrices $f(B_i)$ and $C_i$, which must be greater than 0. Its default value is 5.
  • -k or --k. The number of columns of columns of matrix f(A_i) and rows of f(B_i)

Key APIs and Concepts

  • The performance of a numerical multi-linear algebra code can be heavily increased by using tensor contractions [ Y. Shi et al., HiPC, pp 193, 2016. ], thereby most of the rocBLAS functions have a_batched and a _strided_batched [ C. Jhurani and P. Mullowney, JPDP Vol 75, pp 133, 2015. ] extensions.
    We can apply the same multiplication operator for several matrices if we combine them into batched matrices. Batched matrix multiplication has a performance improvement for a large number of small matrices. For a constant stride between matrices, further acceleration is available by strided batched GEMM.
    strided-matrix-layout.svg

  • rocBLAS is initialized by calling rocblas_create_handle(rocblas_handle*) and it is terminated by calling rocblas_destroy_handle(rocblas_handle).

  • The pointer mode controls whether scalar parameters must be allocated on the host (rocblas_pointer_mode_host) or on the device (rocblas_pointer_mode_device). It is controlled by rocblas_set_pointer_mode.

  • rocblas_stride strides between matrices or vectors in strided_batched functions.

  • rocblas_[sdhcz]gemm_strided_batched

    Depending on the character matched in [sdhcz], the norm can be obtained with different precisions:

    • s (single-precision: rocblas_float)
    • d (double-precision: rocblas_double)
    • h (half-precision: rocblas_half)
    • c (single-precision complex: rocblas_complex)
    • z (double-precision complex: rocblas_double_complex).

    Input parameters:

    • rocblas_handle handle
    • rocblas_operation transA: transformation operator on $A_i$ matrix
    • rocblas_operation transB: transformation operator on $B_i$ matrix
    • rocblas_int m: number of rows in $f(A_i)$ and $C_i$ matrices
    • rocblas_int n: number of columns in $f(B_i)$ and $C_i$ matrices
    • rocblas_int k: number of columns in $f(A_i)$ matrix and number of rows in $f(B_i)$ matrix
    • const float *alpha: scalar multiplier of $C_i$ matrix addition
    • const float *A: pointer to each $A_i$ matrix
    • rocblas_int lda: leading dimension of each $A_i$ matrix
    • rocblas_stride stride_a: stride size for each $A_i$ matrix
    • const float *B: pointer to each $B_i$ matrix
    • rocblas_int ldb: leading dimension of each $B_i$ matrix
    • const float *beta: scalar multiplier of the $B \cdot C$ matrix product
    • rocblas_stride stride_b: stride size for each $B_i$ matrix
    • float *C: pointer to each $C_i$ matrix
    • rocblas_int ldc: leading dimension of each $C_i$ matrix
    • rocblas_stride stride_c: stride size for each $C_i$ matrix
    • rocblas_int batch_count: number of matrices

    Return value: rocblas_status

Demonstrated API Calls

rocBLAS

  • rocblas_int
  • rocblas_float
  • rocblas_operation
  • rocblas_operation_none
  • rocblas_handle
  • rocblas_create_handle
  • rocblas_destroy_handle
  • rocblas_set_pointer_mode
  • rocblas_pointer_mode_host
  • rocblas_stride
  • rocblas_sgemm_strided_batched

HIP runtime

  • hipMalloc
  • hipFree
  • hipMemcpy
  • hipMemcpyHostToDevice
  • hipMemcpyDeviceToHost