DLPrimitives
Comparing GPUs from different Vendors

Base Line

Note we compare 3 different GPUs that have similar performance withing reasonable margins.

AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.

The basic flops performance measured using custom kernel. Flops performance of moden GPUs can be calculated as clock * cores * 2, however clock dependes on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observerd during benchmarks.

gpu GFlops GB/sCoresClock MhzExp GFlopsExp GB/sFlops %Mem %
6600xt 9,937 216 2048 2655 10,875 256 91.6% 84.4%
1080 8,970 242 2560 1809 9,262 320 96.9% 75.6%
2060s 8,263 396 2176 1905 8,290 448 98.7% 88.4%

So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.

Training Times

Measured in ms per batch, lower is better.

Framework gpu alexnetresnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 83.73 231.27 716.06 1157.2 414.35
dlprim 1080 93.037 262.140 926.624^ 1348.9 614.016
dlprim 2060s 116.41 252.349 705.228^ 1681.3 355.212
keras/tf2 1080 70.561 200.582 684.426^ 633.05 437.844
keras/tf2 2060s 70.006 172.191 520.024^ 553.12 344.548
pytorch 1080 62.379 151.352 518.029 780.9 229.200
pytorch 2060s 41.116 121.182 377.803 621.1^ 143.225

Testing Times

Measured in ms per batch, lower is better.

Framework gpu alexnetresnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 34.28 63.57 185.72 277.97 102.84
dlprim 1080 28.036 63.573 274.275 309.285131.745
dlprim 2060s 47.523 81.097 210.975 428.34997.800
keras/tf2 1080 40.554 80.646 199.389 189.075109.856
keras/tf2 2060s 47.953 75.736 165.315 174.27293.015
pytorch 1080 16.361 43.174 144.880 226.40760.135
pytorch 2060s 9.650 33.272 107.568 172.47235.551

^) Using half batch x32 twice, due to GPU memory limits