-
Notifications
You must be signed in to change notification settings - Fork 7
loss functions
Differentiable version of argmax
Formula : e^(x_i)/sum_{j}{e^{x_j}}
If one of the x_j is much higher then the others, the values will be close to 0 for all except the j index, for which the value will be close to one.
See this friendly post which starts from information entropy to explain cross entropy as a choice of loss, and shows that minimizing the cross entropy is equivalent to minimizing the negative log likelyhood
H(y) = \sum_i y_i * log(1/y_i) = -y_i log(y_i)
Interpretation : mean size of bit encoding under distribution y
ground truth distribution: y, estimated distribution y_hat
C(y, y_hat) = - \sum_i y_i log(y_hat_i)
Interpretation : mean bit size of encoding of y under wrong distribution y
KL(y, y_hat) = cross_entropy(y, y_hat) - entropy(y)
Minimizing KL is the same as minimizing cross_entropy
It is the same as maximizing the likelyhood (or minimizing the negative log likelyhood)
\Product_{(x, y)} P(x,y | theta)