Models -- such as generalised linear regressions -- can be made more powerful by transforming the data first in such a way that it fits the model assumptions better. We can go further and jointly fit the parameters of the transformation and the regression simultaneously, such that the transformation is not "blind", but serves directly to improve model fit. Moreover, we can chain transformations/models together to achieve flexible parameterised pipelines all parts of which are jointly optimised.
There are two main facilitators for this way of modelling: (1) differentiable programming ala autodiff and gradient descent, and (2) global optimisation. Point 1 is well suited to fitting neural network architectures and well behaved continuous parametric functions. Point 2 is moreso how scientists have been fitting all sorts of parametric models since the 80s, which is the approach I've taken in this demonstration.
Statistical models are typically conceived as invertable parametric functions. That is, we start with some function which we imagine generates the data we are seeing, and then implement some procedure to recover the function parameters from the data. Take for example a power transform attached to a linear regression:
where
In this demonstration I will use the following data generating function:
-
Start with a straight line:
$y = ax + b$ . -
Pick a point somewhere along the line and divide the line into two segments.
-
Power transform each segment: I use the Yeo-Johnson transform because its parameter is not range bound making it convenient for optimisation.
-
Add some Gaussian noise to each data point. I use the Box-Muller transform to convert points sampled from a uniform distribution into a Gaussian.
The result is a non-linear function with a vast variety
of possible shapes. I won't pursue it here, but in the
general case, we could have
where
Here is a plot of what it looks like with the following parameters:
The practical applications of this exercise emerge in the reverse direction wherein we have the data, assume that it fits our model, and wish to recover the model parameters from it. Whether the model is true or not does not necessarily matter, as long as the shape of the data is within the range of the model, it can still be utilised for useful work.
Since the sigmoid is essentially flat on either side
of
Here is a sketch of the solution:
-
Write down the cost function:
$\sum_i (y_i - \hat{y_i})^2$ . Its just the sum of square differences between the predict$\hat{y_i}$ and the actual value$y_i$ . -
Write down the perturbation function which generates the next candidate solution taking the current solution as an argument. The function herein just adds some random noise to the current solution to generate a new candidate and progressively modifies fewer parameters as the optimisation goes on (see Notes).
-
Implement threshold accepting annealing.
-
Apply it to some data generated from the function given above.
The results are very good especially given an algorithm
of such simplicity. If I set
SSE = 2.41737473692175
x0 = 50.00012595144228
l1 = 0.7002187213885587
l2 = 1.4006478656514114
a = 1.9949945599502708
b = 100.06270480468355
Meanwhile, here are the results for 1000 points fitted from the noisy function:
SSE = 2463477.4443425345,
x0 = 50.00456239383456,
l1 = 0.6207606244772877
l2 = 1.4132729980396663
a = 1.89497964824292
b = 105.48395165961668
ghc -O2 analysis.hs && ./analysis
gnuplot plot-example.gp > example.png
gnuplot plot-fitted.gp > fitted.png
-
$\lambda_1,\lambda_2$ are very sensitive to small differences so using theDouble
data type is essential. -
The starting vector used was
[0,1,1,1,1]
: nowhere near the actual values. The algorithm is surprisingly robust to initial values. -
The threshold schedule is very important for convergence. I found that a small number of iterations spent over a large number of monotonically decreasing thresholds makes convergence most likely.
-
The perturbation function stochastically reduces the number of parameters modified as the optimisation progresses. This is essential for best performance because if the solution is already close to the optimum, it is more likely that perturbation of fewer parameters will result in a better state than a perturbation of all parameters at once.