Derivatives of Softmax
Last updated
Last updated
Before diving into computing the derivative of softmax, let's start with some preliminaries from vector calculus.
Softmax is fundamentally a vector function. It takes a vector as input and produces a vector as output; in other words, it has multiple inputs and multiple outputs. Therefore, we cannot just ask for "the derivative of softmax"; We should instead specify:
Which component (output element) of softmax we're seeking to find the derivative of.
Since softmax has multiple inputs, with respect to which input element the partial derivative is computed.
If this sounds complicated, don't worry. This is exactly why the notation of vector calculus was developed. What we're looking for is the partial derivatives:
Since softmax is a function, the most general derivative we compute for it is the Jacobian matrix:
In ML literature, the term "gradient" is commonly used to stand in for the derivative. Strictly speaking, gradients are only defined for scalar functions (such as loss functions in ML); for vector functions like softmax it's imprecise to talk about a "gradient"; the Jacobian is the fully general derivate of a vector function, but in most places I'll just be saying "derivative".