Skip to content

Implementation of the popular network Inception v3 on Numpy. Implementation of the AdaSmooth optimizer. Comparison of optimizers on Cars dataset.

Notifications You must be signed in to change notification settings

borisshapa/inception-v3-numpy

Repository files navigation

Table of content

Inception v3

arxiv: https://arxiv.org/abs/1512.00567

The advantages of the architecture are as follows:

  • Getting rid of bottlenecks, that is, layers with a sharp decrease in dimension, especially early in the network.
  • Wide layers learn faster.
  • Neighboring pixels in the image are highly correlated with each other, so the dimensionality in front of the convolutional layer can be reduced without losing representational power.
  • A proportional increase in the depth and width of the network is more efficient than an increase in depth alone.

The architecture consists of inception blocks of different types:

  • Inception A

    Сonvolution 5x5 and two consecutive convolutions 3x3 have the same receptive field, so it is proposed to process the image in two ways and then concatenate.

    The pooling layer can be placed before or after the convolutions, in both cases the output dimension will be the same. But in the first case there will be a sharp decrease in the number of activations, and the second is inefficient from the point of view of calculations. Therefore, it is proposed to use a pooling layer parallel to two other branches and also concatenate.

  • Inception B

    Here the scheme is similar to block Inception A. Convolutional layers with strides are used.

  • Inception C

    Instead of convolution nxn, we can apply convolution nx1 and convolution 1xn. This makes the architecture computationally efficient. In this block, n is equal to 7.

  • Inception D

  • Inception E

    Same trick with convolutions 1xn and nx1, but n is 3.

The resulting model contains 24 388 980 trainable parameters.

Implementation

The numpy model was trained on a mnist dataset of 1750 steps with a batch size of 8. The training took 42 hours. From the plot, you can see that the model is training, the loss is falling.

Tests

There are tests that check whether the outputs and gradients of the modules written in numpy and modules from the torch library match: test/test_layers.py

To run tests, use the command:

pytest tests/test_layers.py

AdaSmooth

arxiv: https://arxiv.org/abs/2204.00825

The main advantage of the approach is that it eliminates the need to sort through the hyperparameters by hand, except for the learning rate.

In addition, this method, unlike other optimizers, takes into account not only the absolute value of the gradients, but also the overall movement in each dimension.

Effective Ratio

This is obtained thanks to the coefficient efficiency ratio (ER), which is often used in the field of finance.

$$e_t=\frac{s_t}{n_t} = \frac{\text{Total move for a period}}{\text{Sum of absolute move for each bar}} = \frac{|x_t - x_{t - M}|}{\sum\limits_{i=0}^{M - 1} |x_{t - i} - x_{t - 1 - i}|}$$

Or

$$e_t = \frac{|\sum\limits_{i=0}^{M - 1}{\Delta x_{t - 1 -i}}|}{\sum\limits_{i=0}^{M - 1}{|\Delta x_{t - 1 - i}|}}$$

where $x_i$ – the vector of values of all trainable model parameters at step $i$ and $\Delta x_i$ – step values vector of gradient descent at step $i$.

A large value of this coefficient in the dimension of $i$ means that in the space of parameters the gradient descent moves in a certain direction in this dimension, smaller values mean that the gradient descent moves in a zigzag pattern.

To simplify calculations, $M$ is chosen equal to the batch index at each epoch. In this case, there is no need to store the $M$ values of all parameters, but you can recalculate the value of EP in the accumulated fashion.

AdaSmooth

In the algorithm AdaDelta (RMSProp), the delta step is calculated as follows:

$$\Delta x_t = - \frac{\eta}{\text{RMS}[g]_t} \odot g_t$$

where

$$\text{RMS}[g]_t = \sqrt{E[g^2]_t + \epsilon}$$

$$E[g^2]t = \rho E[g^2]{t - 1} + (1 - \rho) g^2_t$$

The constant $1 - \rho$ is also known as the smoothing constant (SC) and can be written as

$$\text{SC} = \frac{2}{N + 1}$$

and the period $N$ can be thought of as the number of past values to do the moving average calculation.

Let's take a small period value and a large period value $N_1 = 3, N_2 = 199$. Taking values outside this range is inefficient. Then, with a small value of the period, a small number of past steps are taken into account and gradient descent quickly moves in a new direction. With a large period value, gradient descent tends to move in the direction of the accumulated gradients and slowly changes direction in the new direction.

Let's denote

$$\text{fast SC} = \frac{2}{N_1 + 1} = 1 - \rho_1 = 0.5$$

$$\text{slow SC} = \frac{2}{N_2 + 1} = 1 - \rho_2 = 0.01$$

The value of the period is just determined by the behavior of gradient descent at each step, which is expressed by the coefficient ER.

$$c_t = (\rho_2 - \rho_1) \times e_t + (1 - \rho_2)$$

The value of $E[g^2]_t$, in contrast to the AdaDelta method, is calculated as follows:

$$E[g^2]t = c^2_t \odot g^2_t + (1 - c^2_t) \odot E[g^2]{t - 1}$$

The rest is calculated in the same way as in the AdaDelta method.

Implementations

Experiments and results

Torch implementation of Inception V3 was trained with two optimizers on the cars dataset: AdaSmooth and Adam. The dataset contains 196 classes – car brands.

The models were trained for 50 epochs with a batch size of 128. The learning rate for both optimizers is 1e-3. The implementation of adam is taken from the torch library, the beta parameters are equal 0.9 and 0.999.

Training one epoch with the Adam optimizer took 13:51, AdaSmooth 14:31.

The values of the loss function are shown in the plot below.

As you can see, the method AdaSmooth converges more slowly than the Adam method. But when training a model with an optimizer AdaSmooth, the accuracy on the validation set reaches higher values with the same number of steps (0.685 vs 0.50625):

Such results are obtained due to the features of the optimizer AdaSmooth. It adjusts the parameters depending on how gradient descent behaves. That is, he needs time to correctly adjust the coefficient ER. Therefore, the convergence of this method is slower.

Since the method takes into account not only the absolute values of the gradients, but also the overall shift in a certain direction. This more accurate information allows you to find more optimal parameters for the model, so the quality when using such an optimizer is higher.

About

Implementation of the popular network Inception v3 on Numpy. Implementation of the AdaSmooth optimizer. Comparison of optimizers on Cars dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published