arxiv: https://arxiv.org/abs/1512.00567
The advantages of the architecture are as follows:
- Getting rid of bottlenecks, that is, layers with a sharp decrease in dimension, especially early in the network.
- Wide layers learn faster.
- Neighboring pixels in the image are highly correlated with each other, so the dimensionality in front of the convolutional layer can be reduced without losing representational power.
- A proportional increase in the depth and width of the network is more efficient than an increase in depth alone.
The architecture consists of inception blocks of different types:
-
Inception A
Сonvolution 5x5 and two consecutive convolutions 3x3 have the same receptive field, so it is proposed to process the image in two ways and then concatenate.
The pooling layer can be placed before or after the convolutions, in both cases the output dimension will be the same. But in the first case there will be a sharp decrease in the number of activations, and the second is inefficient from the point of view of calculations. Therefore, it is proposed to use a pooling layer parallel to two other branches and also concatenate.
-
Inception B
Here the scheme is similar to block Inception A. Convolutional layers with strides are used.
-
Inception C
Instead of convolution nxn, we can apply convolution nx1 and convolution 1xn. This makes the architecture computationally efficient. In this block, n is equal to 7.
-
Inception D
-
Inception E
Same trick with convolutions 1xn and nx1, but n is 3.
The resulting model contains 24 388 980 trainable parameters.
-
Numpy: numpy_modules
-
Torch: torch_modules (based on the implementation from the torch vision library)
The numpy model was trained on a mnist dataset of 1750 steps with a batch size of 8. The training took 42 hours. From the plot, you can see that the model is training, the loss is falling.
There are tests that check whether the outputs and gradients of the modules written in numpy and modules from the torch library match: test/test_layers.py
To run tests, use the command:
pytest tests/test_layers.py
arxiv: https://arxiv.org/abs/2204.00825
The main advantage of the approach is that it eliminates the need to sort through the hyperparameters by hand, except for the learning rate.
In addition, this method, unlike other optimizers, takes into account not only the absolute value of the gradients, but also the overall movement in each dimension.
This is obtained thanks to the coefficient efficiency ratio (ER), which is often used in the field of finance.
Or
where
A large value of this coefficient in the dimension of
To simplify calculations,
In the algorithm AdaDelta (RMSProp), the delta step is calculated as follows:
where
$$E[g^2]t = \rho E[g^2]{t - 1} + (1 - \rho) g^2_t$$
The constant
and the period
Let's take a small period value and a large period value
Let's denote
The value of the period is just determined by the behavior of gradient descent at each step, which is expressed by the coefficient ER.
The value of
$$E[g^2]t = c^2_t \odot g^2_t + (1 - c^2_t) \odot E[g^2]{t - 1}$$
The rest is calculated in the same way as in the AdaDelta method.
- Numpy: ada_smooth.py
- Torch: torch_ada_smooth.py
Torch implementation of Inception V3 was trained with two optimizers on the cars dataset: AdaSmooth and Adam. The dataset contains 196 classes – car brands.
The models were trained for 50 epochs with a batch size of 128. The learning rate for both optimizers is 1e-3. The implementation of adam is taken from the torch library, the beta parameters are equal 0.9 and 0.999.
Training one epoch with the Adam optimizer took 13:51, AdaSmooth 14:31.
The values of the loss function are shown in the plot below.
As you can see, the method AdaSmooth converges more slowly than the Adam method. But when training a model with an optimizer AdaSmooth, the accuracy on the validation set reaches higher values with the same number of steps (0.685 vs 0.50625):
Such results are obtained due to the features of the optimizer AdaSmooth. It adjusts the parameters depending on how gradient descent behaves. That is, he needs time to correctly adjust the coefficient ER. Therefore, the convergence of this method is slower.
Since the method takes into account not only the absolute values of the gradients, but also the overall shift in a certain direction. This more accurate information allows you to find more optimal parameters for the model, so the quality when using such an optimizer is higher.