DLPrimitives
|
Note we compare 3 different GPUs that have similar performance withing reasonable margins.
AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.
The basic flops performance measured using custom kernel. Flops performance of moden GPUs can be calculated as clock * cores * 2, however clock dependes on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observerd during benchmarks.
gpu | GFlops | GB/s | Cores | Clock Mhz | Exp GFlops | Exp GB/s | Flops % | Mem % |
---|---|---|---|---|---|---|---|---|
6600xt | 9,937 | 216 | 2048 | 2655 | 10,875 | 256 | 91.6% | 84.4% |
1080 | 8,970 | 242 | 2560 | 1809 | 9,262 | 320 | 96.9% | 75.6% |
2060s | 8,263 | 396 | 2176 | 1905 | 8,290 | 448 | 98.7% | 88.4% |
So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.
Measured in ms per batch, lower is better.
Framework | gpu | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|
dlprim | 6600xt | 83.73 | 231.27 | 716.06 | 1157.2 | 414.35 |
dlprim | 1080 | 93.037 | 262.140 | 926.624^ | 1348.9 | 614.016 |
dlprim | 2060s | 116.41 | 252.349 | 705.228^ | 1681.3 | 355.212 |
keras/tf2 | 1080 | 70.561 | 200.582 | 684.426^ | 633.05 | 437.844 |
keras/tf2 | 2060s | 70.006 | 172.191 | 520.024^ | 553.12 | 344.548 |
pytorch | 1080 | 62.379 | 151.352 | 518.029 | 780.9 | 229.200 |
pytorch | 2060s | 41.116 | 121.182 | 377.803 | 621.1^ | 143.225 |
Measured in ms per batch, lower is better.
Framework | gpu | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|
dlprim | 6600xt | 34.28 | 63.57 | 185.72 | 277.97 | 102.84 |
dlprim | 1080 | 28.036 | 63.573 | 274.275 | 309.285 | 131.745 |
dlprim | 2060s | 47.523 | 81.097 | 210.975 | 428.349 | 97.800 |
keras/tf2 | 1080 | 40.554 | 80.646 | 199.389 | 189.075 | 109.856 |
keras/tf2 | 2060s | 47.953 | 75.736 | 165.315 | 174.272 | 93.015 |
pytorch | 1080 | 16.361 | 43.174 | 144.880 | 226.407 | 60.135 |
pytorch | 2060s | 9.650 | 33.272 | 107.568 | 172.472 | 35.551 |
^) Using half batch x32 twice, due to GPU memory limits