Cross Entropy Loss

Consider again a classification problem with KK labels. The KK-dimensional regression function for the one-hot encoded target (Y(1),,Y(K))RK(Y^{(1)},\ldots,Y^{(K)})\in\mathbb{R}^K is given by μ(x)=(μ1(x),,μK(x))\mu(x)=(\mu_1(x),\ldots,\mu_K(x)) where μk(x)=P(Y=CkX=x).\mu_k(x)=\mathbb{P}\left( Y=\mathcal{C}_k\mid X=x\right).

For the multivariate target (Y(1),,Y(K))(Y^{(1)},\ldots,Y^{(K)}), our label set Y\mathcal{Y} should be a subspace of the vectors obeying the axioms of probability, that is, YPK:={y(0,1)K:k=1Kyk=1}\mathcal{Y}\subset\mathcal{P}_K:=\left\lbrace y\in (0,1)^K: \sum_{k=1}^{K}y_k=1\right\rbrace where yky_k denotes the kk-th entry of the vector yy. Note that PK\mathcal{P}_K is the space of distributions on (C1,,CK)(\mathcal{C}_1,\ldots,\mathcal{C}_K).

What loss functions make the multivariate regression rule μ(x)\mu(x) ideal for the one-encoded target? A quick answer is the half squared loss function 2:RK×RK[0,)\ell_2:\mathbb{R}^K\times\mathbb{R}^K\rightarrow [0,\infty) given by 2(f(x),y)=12f(x)y2=12k=1K(fk(x)yk)2\begin{align*}\ell_2(f(x),y)=&\frac{1}{2}\left\|f(x)-y\right\|^2\\=&\frac{1}{2}\sum_{k=1}^{K}(f_k(x)-y_k)^2\end{align*} where fk(x)f_k(x) denotes the kk-th entry of a candidate multivariate function f(x)f(x).

Alternatively, one may use the negative conditional log-likelihood function based on the principle of maximum likelihood, namely, CE(f(x),y)=logL(f(x)y,x) \ell_{\operatorname{CE}}(f(x),y)=-\log \mathcal{L}(f(x)|y,x) where L(f(x)y,x)\mathcal{L}(f(x)|y,x), y{C1,,CK}y\in\{\mathcal{C}_1,\ldots,\mathcal{C}_K\} and xXx\in\mathcal{X}, is the (conditional) likelihood function given by L(f(x)Ck,x)=fk(x)=modelP(Y=CkX=x),k=1,,K.\begin{align*}\mathcal{L}(f(x)|\mathcal{C}_k,x)&=f_k(x)\\&\stackrel{\text{model}}{=}\mathbb{P}(Y=\mathcal{C}_k|X=x),\quad k=1,\ldots,K.\end{align*} Below is an equivalent expression of the cross-entropy.

Cross Entropy Loss #

For a given number of classes KK, the cross-entropy loss function CE:PK×PK[0,)\ell_{\operatorname{CE}}:\mathcal{P}_K\times\mathcal{P}_K\rightarrow [0,\infty) is given by CE(f(x),y)=k=1Kyklogfk(x).\ell_{\operatorname{CE}}(f(x),y)=\sum_{k=1}^{K}-y_k\cdot \log f_k(x).

In information theory, the cross-entropy CE(f(x),y)\ell_{\operatorname{CE}}(f(x),y) measures the distance between two distributions over the labels (C1,,CK)(\mathcal{C}_1,\ldots,\mathcal{C}_K):

  • One with the posterior probabilities (f1(x),,fK(x))(f_1(x),\ldots,f_K(x)); and
  • The other one with probabilities (y1,,yK)(y_1,\ldots,y_K).
Previous Section: Softmax Function