Multi-Class and Cross Entropy Loss

Two basic functions:

  1. Logits -> Exponential Function -> Normalization -> Probability

  2. - P(true) * Log ( Probability ) measures how small the target probability, and the smaller the loss larger.

  3. Cross Entropy loss: if the right class is predicted as 1, then the loss would be 0; if the right class if predicted as 0 ( totally wrong ), then the loss would be infinity.

  4. At the first iteration, each class probability would be like 1/C, and the expected initial loss would be -log(1 / C), and it equals to - ( log (1) - log (C)), equals to log (C) - 0, equals to log (C). This is a good checking criterion.

  5. SVM Loss: The loss Li for any individual example, we will be going to perform a sum of all categories Y, except for the true category yi. So we will sum over all incorrect classes and compare the difference between the correct class and incorrect class. And if the correct class score is larger than the incorrect class score by a margin, the loss on this incorrect class would be zero; otherwise, the loss on this incorrect class would be Yincorrect + 1 - Ycorrect. In one formulation, Li = max(Yj + 1 - Yi), for j in all classes && j != i.

  6. Difference between multi-class SVM loss: In multi-class SVM loss, it mainly measures how wrong the non-target classes ( wants the target class score to be larger than others by a margin, and if the target loss is already larger than by the margin, jiggling it won’t influence the final loss); While cross-entropy loss always forces the target score to be as near 1 as possible.

  7. Will the Cross Entropy Loss ever be the max (infinity)? No. Because that means the normalized logits of the correct class should be EXACTLY zero, and which means the exponential term, before normalization, should be zero too, and again which means the raw logits before exponential function should be negative infinity and computer doesn’t do that well to generate infinity output. So the Infinity loss is not expected.

For example, the 3-element vector [1.0, 2.0, 3.0] gets transformed into [0.09, 0.24, 0.67]. The order of elements by relative size is preserved, and they add up to 1.0. Let's tweak this vector slightly into: [1.0, 2.0, 5.0]. We get the output [0.02, 0.05, 0.93], which still preserves these properties. Note that as the last element is farther away from the first two, it's softmax value is dominating the overall slice of size 1.0 in the output. Intuitively, the softmax function is a "soft" version of the maximum function. Instead of just selecting one maximal element, softmax breaks the vector up into parts of a whole (1.0) with the maximal input element getting a proportionally larger chunk, but the other elements getting some of it as well

Last updated