Consider again a classification problem with K labels. The K-dimensional regression function for the
one-hot encoded target
(Y(1),…,Y(K))∈RK
is given by
μ(x)=(μ1(x),…,μK(x))
where
μk(x)=P(Y=Ck∣X=x).
For the multivariate target (Y(1),…,Y(K)), our label set Y should be a subspace of the vectors obeying the
axioms of probability, that is,
Y⊂PK:={y∈(0,1)K:k=1∑Kyk=1}
where yk denotes the k-th entry of the vector y. Note that PK is the space of distributions on (C1,…,CK).
What loss functions make the multivariate regression rule μ(x) ideal for the one-encoded target? A quick answer is the half
squared loss function ℓ2:RK×RK→[0,∞) given by
ℓ2(f(x),y)==21∥f(x)−y∥221k=1∑K(fk(x)−yk)2
where fk(x) denotes the k-th entry of a candidate multivariate function f(x).
Alternatively, one may use the negative conditional log-likelihood function based on the principle of maximum likelihood, namely,
ℓCE(f(x),y)=−logL(f(x)∣y,x)
where L(f(x)∣y,x), y∈{C1,…,CK} and x∈X, is the (conditional) likelihood function given by
L(f(x)∣Ck,x)=fk(x)=modelP(Y=Ck∣X=x),k=1,…,K.
Below is an equivalent expression of the cross-entropy.
Cross Entropy Loss
#
For a given number of classes K, the cross-entropy loss function ℓCE:PK×PK→[0,∞) is given by
ℓCE(f(x),y)=k=1∑K−yk⋅logfk(x).
In
information theory, the cross-entropy ℓCE(f(x),y) measures the distance between two distributions over the labels (C1,…,CK):- One with the posterior probabilities (f1(x),…,fK(x)); and
- The other one with probabilities (y1,…,yK).