Unified View of Regression and Classification
One-Hot Encoding
Bayes Classifier VS Regression Function # Consider the two-class classification problem with a label set given by $\mathcal{Y}=\{-1,1\}$, without loss of generality. The regression function for the binary variable $Y$ is given by $$\begin{align*} \mu(x)=&\mathbb{E}[Y\mid X=x]\\=&\mathbb{P}(Y=1\mid X=x)\cdot 1\\&+\mathbb{P}(Y=-1\mid X=x)\cdot (-1)\\=&\mathbb{P}(Y=1\mid X=x)\\&-\mathbb{P}(Y=-1\mid X=x).\end{align*} $$ The Bayes classifier becomes nothing else but the sign of the regression function $$ \underset{y\in\{-1,1\}}{\operatorname{argmax}}~\mathbb{P}(Y=y\mid X=x) =\operatorname{sign}(\mu(x)) $$ except for the feature values at the decision boundary $\{x:\mu(x)=0\}$ for which we can arbitrarily assign the labels. ...Learn More>
Softmax Function
Consider a classification problem with $K$ labels and the one-hot encoded target $(Y^{(1)},\ldots,Y^{(K)}) \in\{0,1\}^K$. Fitting a candidate prediction rule, say, $f_k(x)$ separately to each regression function $\mathbb{P}[Y^{(k)}=1|X=x]$ may violate the axioms of probability. Axioms of Probability # For all $x\in\mathcal{X}$: $$f_{1}(x),\ldots,f_{K}(x)\geq 0,\\ \sum_{k=1}^{K} f_{k}(x)=1.$$ Relaxing this condition complicates our estimators’ interpretation and worsens our predictions’ statistical performance. One way to impose the axioms is to model the posterior probabilities jointly by ...Learn More>
Cross Entropy Loss
Consider again a classification problem with $K$ labels. The $K$-dimensional regression function for the one-hot encoded target $$(Y^{(1)},\ldots,Y^{(K)})\in\mathbb{R}^K$$ is given by $$\mu(x)=(\mu_1(x),\ldots,\mu_K(x))$$ where $$\mu_k(x)=\mathbb{P}\left( Y=\mathcal{C}_k\mid X=x\right).$$ For the multivariate target $(Y^{(1)},\ldots,Y^{(K)})$, our label set $\mathcal{Y}$ should be a subspace of the vectors obeying the axioms of probability, that is, $$\mathcal{Y}\subset\mathcal{P}_K:=\left\lbrace y\in (0,1)^K: \sum_{k=1}^{K}y_k=1\right\rbrace $$ where $y_k$ denotes the $k$-th entry of the vector $y$. Note that $\mathcal{P}_K$ is the space of distributions on $(\mathcal{C}_1,\ldots,\mathcal{C}_K)$. ...Learn More>