Unified View of Regression and Classification
One-Hot Encoding
Bayes Classifier VS Regression Function # Consider the two-class classification problem with a label set given by $\mathcal{Y}=\{-1,1\}$, without loss of generality. The regression function for the binary variable $Y$ is given by $$\begin{align*} \mu(x)=&\mathbb{E}[Y\mid X=x]\\=&\mathbb{P}(Y=1\mid X=x)\cdot 1\\&+\mathbb{P}(Y=-1\mid X=x)\cdot (-1)\\=&\mathbb{P}(Y=1\mid X=x)\\&-\mathbb{P}(Y=-1\mid X=x).\end{align*} $$ The Bayes classifier becomes nothing else but the sign of the regression function $$ \underset{y\in\{-1,1\}}{\operatorname{argmax}}~\mathbb{P}(Y=y\mid X=x) =\operatorname{sign}(\mu(x)) $$ except for the feature values at the decision boundary $\{x:\mu(x)=0\}$ for which we can arbitrarily assign the labels.
Softmax Function
Consider a classification problem with $K$ labels and the one-hot encoded target $(Y^{(1)},\ldots,Y^{(K)}) \in\{0,1\}^K$. Fitting a candidate prediction rule, say, $f_k(x)$ separately to each regression function $\mathbb{P}[Y^{(k)}=1|X=x]$ may violate the axioms of probability. Axioms of Probability # For all $x\in\mathcal{X}$: $$f_{1}(x),\ldots,f_{K}(x)\geq 0,\\ \sum_{k=1}^{K} f_{k}(x)=1.$$ Relaxing this condition complicates our estimators' interpretation and worsens our predictions' statistical performance. One way to impose the axioms is to model the posterior probabilities jointly by
Cross Entropy Loss
Consider again a classification problem with $K$ labels. The $K$-dimensional regression function for the one-hot encoded target $$(Y^{(1)},\ldots,Y^{(K)})\in\mathbb{R}^K$$ is given by $$\mu(x)=(\mu_1(x),\ldots,\mu_K(x))$$ where $$\mu_k(x)=\mathbb{P}\left( Y=\mathcal{C}_k\mid X=x\right).$$ For the multivariate target $(Y^{(1)},\ldots,Y^{(K)})$, our label set $\mathcal{Y}$ should be a subspace of the vectors obeying the axioms of probability, that is, $$\mathcal{Y}\subset\mathcal{P}_K:=\left\lbrace y\in (0,1)^K: \sum_{k=1}^{K}y_k=1\right\rbrace $$ where $y_k$ denotes the $k$-th entry of the vector $y$. Note that $\mathcal{P}_K$ is the space of distributions on $(\mathcal{C}_1,\ldots,\mathcal{C}_K)$.