Cross Entropy Loss

Consider again a classification problem with $K$ labels. The $K$-dimensional regression function for the one-hot encoded target $$(Y^{(1)},\ldots,Y^{(K)})\in\mathbb{R}^K$$ is given by $$\mu(x)=(\mu_1(x),\ldots,\mu_K(x))$$ where $$\mu_k(x)=\mathbb{P}\left( Y=\mathcal{C}_k\mid X=x\right).$$

For the multivariate target $(Y^{(1)},\ldots,Y^{(K)})$, our label set $\mathcal{Y}$ should be a subspace of the vectors obeying the axioms of probability, that is, $$\mathcal{Y}\subset\mathcal{P}_K:=\left\lbrace y\in (0,1)^K: \sum_{k=1}^{K}y_k=1\right\rbrace $$ where $y_k$ denotes the $k$-th entry of the vector $y$. Note that $\mathcal{P}_K$ is the space of distributions on $(\mathcal{C}_1,\ldots,\mathcal{C}_K)$.

What loss functions make the multivariate regression rule $\mu(x)$ ideal for the one-encoded target? A quick answer is the half squared loss function $\ell_2:\mathbb{R}^K\times\mathbb{R}^K\rightarrow [0,\infty)$ given by $$\begin{align*}\ell_2(f(x),y)=&\frac{1}{2}\left\|f(x)-y\right\|^2\\=&\frac{1}{2}\sum_{k=1}^{K}(f_k(x)-y_k)^2\end{align*}$$ where $f_k(x)$ denotes the $k$-th entry of a candidate multivariate function $f(x)$.

Alternatively, one may use the negative conditional log-likelihood function based on the principle of maximum likelihood, namely, $$ \ell_{\operatorname{CE}}(f(x),y)=-\log \mathcal{L}(f(x)|y,x)$$ where $\mathcal{L}(f(x)|y,x)$, $y\in\{\mathcal{C}_1,\ldots,\mathcal{C}_K\}$ and $x\in\mathcal{X}$, is the (conditional) likelihood function given by $$\begin{align*}\mathcal{L}(f(x)|\mathcal{C}_k,x)&=f_k(x)\\&\stackrel{\text{model}}{=}\mathbb{P}(Y=\mathcal{C}_k|X=x),\quad k=1,\ldots,K.\end{align*}$$ Below is an equivalent expression of the cross-entropy.

Cross Entropy Loss #
For a given number of classes $K$, the cross-entropy loss function $\ell_{\operatorname{CE}}:\mathcal{P}_K\times\mathcal{P}_K\rightarrow [0,\infty)$ is given by $$\ell_{\operatorname{CE}}(f(x),y)=\sum_{k=1}^{K}-y_k\cdot \log f_k(x).$$

In information theory, the cross-entropy $\ell_{\operatorname{CE}}(f(x),y)$ measures the distance between two distributions over the labels $(\mathcal{C}_1,\ldots,\mathcal{C}_K)$:

One with the posterior probabilities $(f_1(x),\ldots,f_K(x))$; and
The other one with probabilities $(y_1,\ldots,y_K)$.

Cross Entropy Loss #