Pages

Categories

Pytorch OpenCL backend - simplified

Thursday, October 27, 2022, by artyom ; Posted in: Releases; 2 comments

Now installation of opencl backend for pytorch is really simple.

  1. Install nighly version of pytorch for CPU in virtual environment
  2. Clone dlrpim_backend repository and checkout true_out_of_tree_support branch
  3. Update submodules
  4. Run few commands inside repo

     mkdir build
     cd build
     cmake -DCMAKE_PREFIX_PATH=$VIRTUAL_ENV/lib/python3.8/site-packages/torch/share/cmake/Torch ..
     make
     cd ..
    
  5. Run mnist training:

     python mnist.py --device=ocl:0
    

That's it.

Keep updated

Inference of ONNX Models using DLPrimitives

Sunday, January 16, 2022, by artyom ; Posted in: Releases; 0 comments

I worked on integration of inference of ONNX models using DLPrimitives. It isn't a simple task since ONNX operator set is very reach and many things can be implemented in different ways.

After many revisions and improvements I managed to validate multiple imagenet pretrained networks from pytorch, mxnet and few based on TensorFlow (see about issues with TF later)

How do you create a dlprimitives network using ONNX Model?

// load and parse ONNX Model
dp::ONNXModel model;
model.load(onnx_path);

// create network
dp::Context ctx(device_id);
dp::Net net(ctx);


// load parameters
net.load_model(model);

And you are ready to go.

I validated following networks and frameworks:

  • Pytorch, op-sets 9, 11, 13, nets alexnet, vgg16, resnet18, resnext50_32x4d, wide_resnet50_2, efficientnet_b0, efficientnet_b4, regnet_y_400mf, squeezenet1_0, mobilenet_v2, densenet121
  • MXNet: vgg11_bn, alexnet, mobilenetv2_0.25, mobilenet0.25, densenet121, resnet18_v1, squeezenet1.0
  • Tensorflow: op-sets 9, and 11 limited initial support, channel first format: resnet50, densenet121

Some networks on pytorch don't pass due to lack of some of the operators. The situation with TensorFlow is more complicated and only few networks worked ok.

TensorFlow

When I stated validated pretrained keras networks I discovered very surprising thing. TensorFlow uses asymmetrical padding in some cases, since in TF/Keras you don't explicitly provide padding but rather give some vague definition of "same" or "valid" for the padding, in some cases padding may differ on start and end of the image.

Interestingly, cuDNN does not even provide asymmetrical padding option for convolutions. Looking into the code TF does padding manually is such case (that is actually huge waste of memory and memory bandwidth)

So implementing these convolutions will require implementing of new simple padding layer just to make sure we can use dlprimitives for inference of TF models.

To be continued...

Attempt to integrate with OneDNN

Sunday, November 21, 2021, by artyom ; Posted in: Internals; 0 comments

Intel's OneDNN is great project that provides cudnn/inference/training like tools for Intel's GPU.

Also it is called OneDNN... it should be called IntelDNN since it supports only Intel gpus and cpus.

Bottom line I tried to add OneDNN based convolutions for Intel GPU just to discover that my simple GEMM based convolution works better. Why? Apparently Intel's implementation seems to be optimized for Channel Last format only.

https://github.com/oneapi-src/oneDNN/issues/1194

A simple convolution with 3x3 kernel with 64 input and output channels with image dimension of 56 on Intel HD 530 with 400 GFlops capacity gives:

  • 295.6 GFlops for OneDNN's channels last format
  • 144.7 GFlops for dlprimitive's channel first format
  • 33.4(!) GFlops for OneDNN's channels first format.

The problem is that channels first is the most common format used by pytorch, mxnet, caffe and many other tools (including dlprimitives)

Ok... I'll check it later when one of two happens:

  1. They fix channel first performance
  2. I'll support channel last format internally

Pytorch Updates

Tuesday, November 16, 2021, by artyom ; Posted in: Internals; 0 comments

In order to improve the progress I started validating all pretrained torchvision models one by one. I found several features I needed to implement but what is more important I found several critical bugs I could fix.

https://pytorch.org/vision/stable/models.html#classification

At this point following networks are validated against CPU version in both forward and backward propagation:

  • alexnet
  • resnet18
  • resnet50
  • vgg16
  • densenet161
  • googlenet
  • squeezenet1_0
  • inception_v3 (fwd only - backward fails on cuda/cpu)
  • shufflenet_v2_x1_0
  • mobilenet_v2
  • mobilenet_v3_large
  • mobilenet_v3_small (fwd only - same failure on bwd on cuda)
  • resnext50_32x4d
  • wide_resnet50_2
  • mnasnet1_0
  • efficientnet_b0
  • efficientnet_b4
  • regnet_y_400mf

To be continued...

Update Nov 17, 2021: I implemneted ceil rounding pooling mode, thus googlenet and squeezenet1_0 now pass validation

Pointwise Broadcast Reduce

Tuesday, November 16, 2021, by artyom ; Posted in: Internals; 0 comments

Lots of deep learning operations can be implemented as simple element-by-element operations over different tensors with numpy broadcasting and reduction afterwards. For example:

Adding Bias [C] to [B,C,H,W] image is can be seen in numpy as:

 x + bias.reshape((C,1,1))

Gradient of bias can be calculated as:

 np.sum(dy,dims=(0,2,3))

That is simple reduction operations. Calculation of mean and variance in batch normalisation requires calculation of x and x*x over all dims but C.

Observing this I implemented a broadcast/reduce templates API to simplify development. http://dlprimitives.org/docs/pointwise_8hpp_source.html

The idea is following:

  • You provide input tensors and scalar parameters
  • You define the operation need to performed on each operand
  • You provide reduction operation

The OpenCL kernel code is auto-generated for you. For example calculations of x and x*x sums over all dims but channels would look like:

    auto op = dlprim::core::PointwiseOperationBroadcastReduce::create(
                ctx,
                {X.specs()},{Xsum.specs(),X2sum.specs()},
                0,dlprim::float_data, 
                "y0=x0; y1=x0*x0;", // operations
                "reduce_y0 = 0; reduce_y1 = 0", // reduce init
                "reduce_y0 += y0; reduce_y1 += y1"
               );
    op->enqueue({X},{Xsum,X2sum},s,{},{1,1},{0,0},q);

So - 1st output is just x - sum and second is x*x - sum. So if you provide X in shape of [B,C,H,W] and Xsum, X2sum in shape [C,1,1] that is broadcast-able to X you'll get the sums you need without writing custom reduction code of manually writing kernels.

This vastly simplified writing multiple operators especially ones that are expected to support numpy style broadcasting in pytorch.

next page