FROM KERAS TO CAFFE – Deep Vision Consulting

Keras is a great tool to train deep learning models, but when it comes to deploy a trained model on FPGA, Caffe models are still the de-facto standard.

Unfortunately, one cannot simply take a model trained with keras and import it into Caffe. The reason is twofold: first, Caffe doesn’t offer any import functions from keras model; second, there are fundamental differences between how these two frameworks handle weights and apply operations.

In this brief post, we’ll introduce a toy network designed and trained in keras and show how to test it with Caffe. To do so we will see how to:

create the Caffe *.prototxt file from code, given a model definition in keras
create an empty Caffe model from the prototxt file and transfer weights from keras model
test the Caffe model

Everywhere below we’ll assume TF as keras backend and as image channel ordering.

CREATE THE PROTOTXT

Suppose we have the following keras model, where \(W=H=32\).

input_stream = Input(shape=(H, W, 3))
x = input_stream

x = Conv2D(8, (3, 3), strides=(1, 1), padding='same', use_bias=False, name='conv1')(x)
x = BatchNormalization(name='bn1')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2), strides=(2, 2))(x)

x = Conv2D(16, (3, 3), strides=(1, 1), padding='same', use_bias=True, name='conv2')(x)
x = BatchNormalization(name='bn2')(x)
x = LeakyReLU(alpha=0.1)(x)
x = MaxPooling2D(pool_size=(2, 2), strides=(1, 1), padding='same')(x)

output_stream = x

It’s a basic setup with Conv2D, BatchNormalization, LeakyReLU and MaxPooling2D layers. Among these, only Conv2D and BatchNormalization are learnable and thus have weights we wish to transfer to Caffe. To automate such process, we need to assign to each layer with trainable weights a unique identifier through the “name” parameter.
We then need to define a network through the Caffe interface that mimics the one described above. To do this, Caffe provides the NetSpec() class.

import caffe
from caffe import layers as cl, params as cp

net = caffe.NetSpec()
net.data = cl.Input(shape=[dict(dim=[1, 3, H, W])])

net.conv1 = cl.Convolution(net.data, name='conv1', kernel_size=3, stride=1, num_output=8, pad=1)
net.bn1 = cl.BatchNorm(net.conv1, name='bn1', use_global_stats=True)
net.sc1 = cl.Scale(net.bn1, name='bn1_sc', bias_term=True)
net.lr1 = cl.ReLU(net.sc1, negative_slope=0.1)
net.mp1 = cl.Pooling(net.lr1, kernel_size=2, stride=2, pool=cp.Pooling.MAX)

net.conv2 = cl.Convolution(net.mp1, name='conv2', kernel_size=3, stride=1, num_output=16, pad=1)
net.bn2 = cl.BatchNorm(net.conv2, name='bn2', use_global_stats=True)
net.sc2 = cl.Scale(net.bn2, name='bn2_sc', bias_term=True)
net.lr2 = cl.ReLU(net.sc2, negative_slope=0.1)
net.mp2 = cl.Pooling(net.lr2, kernel_size=2, stride=1, pad=1, pool=cp.Pooling.MAX)
net.cr1 = cl.Crop(net.mp2, net.lr2, axis=2, offset=1)

Once the network has been defined, we can simply dump it to the prototxt file with the following line of code:

with open('model_test.prototxt', 'w') as f: f.write(str(net.to_proto()))

There are a few but important differences that’s worth highlighting. Before diving in such differences, note that we require all learnable layers to be named as their counterparts in Keras – this will allow us to associate them while transferring weights.

BATCHNORM
After each BatchNorm, we have to add a Scale layer in Caffe. The reason is that the Caffe BatchNorm layer only subtracts the mean from the input data and divides by their variance, while does not include the \(\gamma\) and \(\beta\) parameters that respectively scale and shift the normalized distribution ¹. Conversely, the Keras BatchNormalization layer includes and applies all of the parameters mentioned above. Using a Scale layer with the parameter “bias_term” set to True in Caffe, provides a safe trick to reproduce the exact behavior of the Keras version.

STRIDING AND PADDING
Keras offers a very nice padding feature for striding operations, i.e. padding=’same’. This ensures that after the operation has been applied, the dimension of the activation map is preserved up to the striding factor. So, if the stride value is \(1\), the input and output tensor will have the same \((W, H)\) size, while the output size will be exactly half when the stride is \(2\). Intuitively, it adds just the right amount of padding to allow all operations to be carried out meaningfully. In Caffe, you have to explicitly state the padding value instead. Now consider the case where you’re running a max pooling with stride \(1\), as the second max-pooling in the toy example above, over an input tensor of size \((W/2, H/2)\). If we tell Caffe not to add any padding, then the output map will be one pixel shorter \((W/2-1, H/2-1)\) than it’s keras counterpart, because a \(2×2\) kernel cannot be applied over the last row and last column of the input. On the other hand, if we add a padding of one, Caffe will add one row and one column on both sides of the input and the output will result in a \((W/2+1, W/2+1)\)-sized tensor. To ensure the same output shape, we thus need to crop the padded output of the Caffe pooling layer. In particular, the wish to discard the first row and the first column — and that’s exactly why we added a Crop layer in the network definition.

TRANSFER THE WEIGHTS

Now that we have defined our empty network, we can fill it with weights! Let’s consider a function that takes two input, the keras model and the Caffe model and returns the Caffe model with the transferred weights.

    net_with_weights = keras_weights_to_caffe_model(keras_model=model, caffe_model=net)

This function will loop through all layers of the Keras model, and stop when a specific layer is named as one of the layers in the Caffe model. In the toy example above, two things can happen: this layer is either a conv layer or a batchnorm layer.

    for layer in keras_model.layers:

        # skip if there is no caffe layer named accordingly
        if not np.any([x == layer.name for x in caffe_model._layer_names]): continue

        if type(layer) == keras.layers.Convolution2D:
            # convert convolutions ...

        if type(layer) == keras.layers.BatchNormalization:
            # convert batchnorm layers ...

2D CONVOLUTIONS
If the layer is a convolution, there are a couple of things to notice. First, keras.layers.Convolution2D.get_weights() can return a list with one or two elements depending on the presence of the bias term. In case of missing bias, we need to add a zero valued array in Caffe. Caffe keeps the same parameter ordering for convolutions, the first element is the kernel, the second one is the bias. But the biggest difference is in the channel ordering. The dimensions of a kernel tensor in keras have the following semantics \((h, w, c_\text{in}, c_\text{out})\), where \(h\times w\) is the kernel size, \(c_\text{in}\) is the number of channel at the previous layer and \(c_\text{out}\) is the number of features to output. Caffe, on the other hand, expects a kernel tensor ordered as \((c_\text{out}, c_\text{in}, h, w)\). Special care is needed if you’re not using the TF image channel ordering since the keras kernel tensor might be differently transposed.

        # convert convolutions
        data = layer.get_weights()
        w, b = data if np.shape(data)[0] > 1 else [data[0], np.zeros((1, np.shape(data)[-1]))]  # the bias term might not be present
        caffe_model.params[layer.name][0].data[...] = np.transpose(w, (3, 2, 0, 1))  # Caffe wants (c_out, c_in, h, w)
        caffe_model.params[layer.name][1].data[...] = b

BATCHNORM
Converting a batchnorm layer is somehow trickier. First, recall that a batchnorm consists of a set of 4 parameters: mean, variance, \(\beta\) and \(\gamma\). These parameters in keras.layers.BatchNormalization.get_weights(), are ordered as [0] \(\gamma\), [1] \(\beta\), [2] mean and [3] variance. However, it is important to know that the BatchNorm layer in Caffe only accounts for mean and variance, and has an additional scaling factor (that divides both mean and variance) that we don’t find in Keras but can be safely set to 1, without loss of generality. For the \(\gamma\) and \(\beta\) parameters, we need to find the respective Scale layer paired with the BatchNorm Caffe layer. It can be retrieved through the BatchNorm name if a proper naming convention was used when defining the network structure. Of upmost importance is to know that Keras adds a regularizing constant of \(1e-3\) to the variance. This value is automatically added in the Keras BatchNormalization layer and ensures that the denominator is never too small and the normalized output is somehow more stable across batches. This constant is not small enough to be neglected and need to be added during the transfer to obtain comparable outputs from this layer from Caffe and Keras.

        # convert batchnorm layers
        gamma, beta, mean, variance = layer.get_weights()
        caffe_model.params[layer.name][0].data[...] = mean
        caffe_model.params[layer.name][1].data[...] = variance + 1e-3
        caffe_model.params[layer.name][2].data[...] = 1  # always set scale factor to 1
        caffe_model.params['{}_sc'.format(layer.name)][0].data[...] = gamma  # scale
        caffe_model.params['{}_sc'.format(layer.name)][1].data[...] = beta  # bias

CONCLUSIONS

While Keras is super useful at training time, we often need to move to other frameworks for deployment. For embedded devices and FPGA, Caffe is typically the supported reference framework but transferring a learned model from Keras to Caffe requires careful supervision. In this post we highlighted some of the differences we have encountered and how to solve them.

Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” International Conference on Machine Learning. 2015.