DLPrimitives
Performance Benchmarks

Summary

Summary of performance comparison of DLPrmitives to Native Pytorch (cuda+cudnn or hip+miopen) and best of existing OpenCL solution - Caffe OpenCL or Kerals with PlaidML. Measured prtformane difference average over 5 networks: alexnet, resnet18, resnet50, vgg16 and mobilenet_v2.

GPUBatchTrain, Cuda/HIPTest, Cuda/HIPTrain, Plaidml/Caffe-OCLTest, Plaidml/Caffe-OCL
Nvidia GTX 9601651%60.73%171%167.33%
Nvidia GTX 960859%72.03%187%155.25%
Nvidia GTX 10801642%41.34%207%137.52%
Nvidia RTX 2060S1649%57.53%211%149.48%
AMD Rx 5601653%56.82%153%115.63%
AMD Rx 560855%54.19%172%122.64%
Intel HD 5308109%66.12%

DlPrimitives vs Other Frameworks

Tested using ResNet18, batch size is 16 224x224 images. Units: milliseconds per batch.

Nvidia GTX 960traintesttraintest
dlprim196.650.7
cudnn/caffe211.865.5108%129%
cudnn/keras183.969.994%138%
cudnn/pytorch110.235.156%69%
AMD RX 560traintesttraintest
dlprim31877.5
rocm/caffe27479.886%103%
rocm/keras240.782.876%107%
rocm/pytorch167.439.253%51%

Methodology

Original networks are taken from pytorch and converted in train mode to ONNX. ONNX is converted to dlprimitives model or to caffe model. dlprimitives model (json) is also used to generate keras/plaidml model. Times for both training and testing (inference) are compared. Warmup of 5 iterations is always used followed by 20 real measurement iterations. Measurement units are milliseconds per batch. Input image size is standard ImageNet size 224x224.

Since Caffe and Keras/Plaidml do not support ReLU6, ReLU is used in benchmarks as substitution for mobilenet_v2.

Benchmarks published for master 8559e6aae8f682e7fdb71379f49aef9b9db4d6fc commit.

Tool and useful instruments

Benchmarks AMD Rx 560

Train

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchrx5601674.763167.852539.061056.31133.747
Keras/Plaidml rx56016700.167944.828--882.795
Caffe/OpenCLrx56016115.57442.0223239.571206.16
dlprimrx56016117.991315.032867.8191820.853452.457
vs opencl98%140%178%195%153%
vs native63%53%62%58%30%53%
gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchrx560848.21886.292288.387556.41776.692
Keras/Plaidml rx5608602.172517.2571493.288-500.769
Caffe/OpenCLrx560875.2738245.0991186.121053.39591.253
dlprimrx560880.037166.648429.012955.522214.052
vs opencl94%147%276%110%234%172%
vs native60%52%67%58%36%55%

Test

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchrx5601618.77439.48159.765224.49145.517
Keras/Plaidml rx5601653.715114.061--50.63
Caffe/OpenCLrx5601640.087141.598645.706269.348
dlprimrx5601635.1176.409204.144455.07588.577
vs opencl114%149%142%57%115.63%
vs native53%52%78%49%51%56.82%
gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchrx560811.68719.95585.969120.14622.316
Keras/Plaidml rx560830.7659.429167.008-29.689
Caffe/OpenCLrx560828.76978.3648243.105335.993144.967
dlprimrx560822.97644.121111.317240.05946.898
vs opencl125%135%150%140%63%122.64%
vs native51%45%77%50%48%54.19%

Benchmarks RX 6600xt

Note: rocm does not support 6600 XT yet, so no comparison to pytorch, is given

Train

gpuBatchalexnetresnet18resnet50vgg16 mobilenet_v2
dlprimRX 6600xt1630.18061.733190.461290.9898.854
Keras/Plaidml RX 6600xt16177.546415.727977.615 3094.2355.140
Caffe/OpenCLRX 6600xt1664.119 144.032 780.264 490.80 349.254

Test

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2
dlprimRX 6600xt1610.81617.69648.82370.77327.083
Keras/Plaidml RX 6600xt1689.684190.738273.0871524.933.210
Caffe/OpenCLRX 6600xt1614.33739.371138.089159.9892.9304

Benchmarks For Nvidia GTX 960

Train

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchGtx 9601641.496109.986350.57510.312154.39
Keras/Plaidml Gtx 96016220.158506.364--570.401
Caffe/OpenCLGtx 96016119.161410.6551007.95
dlprimGtx 9601684.693197.737599.8141074.196344.073
vs opencl141%208%166%171%
vs native49%56%58%48%45%51%
gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchGtx 960833.34667.673196.468347.42382.467
Keras/Plaidml Gtx 9608148.257264.462736.946-296.477
Caffe/OpenCLGtx 960878.0372216.396-1030.12532.378
dlprimGtx 960856.805105.087311.24571.368171.29
vs opencl137%206%237%180%173%187%
vs native59%64%63%61%48%59%

Test

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchGtx 9601611.62234.905110.619165.52442.399
Keras/Plaidml Gtx 9601642.4591.004--44.615
Caffe/OpenCLGtx 9601642.0916127.107-630.991222.616
dlprimGtx 9601622.93251.205163.296247.06884.704
vs opencl184%178%255%53%167.33%
vs native51%68%68%67%50%60.73%
gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchGtx 96088.97522.42660.15122.92822.007
Keras/Plaidml Gtx 960823.36648.312102.89-25.916
Caffe/OpenCLGtx 960828.509568.6208199.54331.147119.853
dlprimGtx 960814.10927.54386.552128.64343.979
vs opencl166%175%119%257%59%155.25%
vs native64%81%69%96%50%72.03%

Benchmarks GTX 1080

Train

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchGTX 10801615.76338.359125.902183.00859.379
Keras/Plaidml GTX 10801692.172235.163702.904972.166330.331
Caffe/OpenCLGTX 10801640.536165.449nan662.584396.664
dlprimGTX 10801631.18370.496260.437363.556137.757
vs opencl130%235%270%182%240%211%
vs native51%54%48%50%43%49%

Test

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchGTX 1080164.91511.73437.32961.54215.849
Keras/Plaidml GTX 10801619.15736.12474.961186.72422.21
Caffe/OpenCLGTX 10801614.20347.517140.202207.39484.103
dlprimGTX 1080168.12818.40481.52484.02335.723
vs opencl175%196%92%222%62%149.48%
vs native60%64%46%73%44%57.53%

Benchmarks RTX 2060S

Train

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchRTX 2060S1611.07830.969100.094148.91637.829
Keras/Plaidml RTX 2060S1668.945199.84541.685926.916243.272
Caffe/OpenCLRTX 2060S1639.007136.877-623.156315.503
dlprimRTX 2060S1632.70766.728181.003432.44290.459
vs opencl119%205%299%144%269%207%
vs native34%46%55%34%42%42%

Test

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
PyTorchRTX 2060S163.6689.77230.69845.48310.234
Keras/Plaidml RTX 2060S1621.88438.82573.711200.79221.007
Caffe/OpenCLRTX 2060S1616.12341.711117.588196.36670.61
dlprimRTX 2060S1612.97221.6456.827112.53326.355
vs opencl124%179%130%174%80%137.52%
vs native28%45%54%40%39%41.34%

Benchmarks Intel HD 530

Train

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
Keras/Plaidml Intel HD 53084804.843961.623009.68718789.2772367.907
Caffe/OpenCLIntel HD 5308878.9472277.145657.17155225002.81
dlprimIntel HD 5308691.1781549.5283601.62611487.1951717.316
Vs plaidml695%62%84%164%138%228%
vs caffe127%147%157%135%291%172%

Test

gpuBatchalexnetresnet18resnet50vgg16mobilenet_v2Average
Keras/Plaidml Intel HD 530863.433116.864254.1152258.903199.77
Caffe/OpenCLIntel HD 5308388.199914.9622106.765781.26966.915
dlprimIntel HD 5308169.902258.331642.2871834.309234.213
Vs plaidml37%45%40%123%85%66.12%
vs caffe228%354%328%315%413%327.74%