DLPrimitives
|
DLPrimitives implements 3 variants of convolution algorithms
GEMM has following optimization. Matrix split into tiles each handled by a separate work group. Typically each work group of 256 threads group loads 2 tiles of 128x16 into local memory and than each thread typically calculates a 8x8 matrix.
When GEMM based algorithm is used GEMM applied as batched convolution and matrix rows/columns are translated to image position for load and store
For backpropogation atomic operations are used for "col2im" part. So working atomics are required for backpropogation.
For Intel GPU no local memory is used and if possible subgroup sharing is used. Each thread operates on its own small tile due to performance problems with local memory
There is a special case when lda or ldb is multiple of 1024 some cache issues occur. For such cases tiles are reordered in z-order curve to improve cache locality
For such cases additional reduction using atomic operations over K is used.
It is currently applied for AMD and NVidia GPUs.
It is based on: Yan, Da, Wei Wang, and Xiaowen Chu. "Optimizing batched winograd convolution on GPUs." Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming. 2020.
https://www.cse.ust.hk/~weiwa/papers/yan-ppopp20.pdf
In comparison to the paper: dlprim uses 32x8 tiles block instead of 64x8 since it generic OpenCL implementation for different GPUs that can't optimize registers as efficienlty as manually written assembly.