Log Odds

Consider a two-class classification problem with the label set $\mathcal{Y}=\{-1,1\}$. Define the log odds function by $$a(x)=\log\frac{\mathbb{P}(Y=1|X=x)}{\mathbb{P}(Y=-1|X=x)},$$ which equals to the logarithm of the odds ratio $$\mathbb{P}(Y=1|X=x)/\mathbb{P}(Y=-1|X=x).$$

By construction, $$\begin{align*}&a(x)>0\\&\Leftrightarrow \mathbb{P}(Y=1|X=x)>\mathbb{P}(Y=-1|X=x)\end{align*}$$ and $$\begin{align*}&a(x)<0\\&\Leftrightarrow \mathbb{P}(Y=1|X=x)<\mathbb{P}(Y=-1|X=x).\end{align*}$$ Therefore, the Bayes classifier is equal to the sign of the log odds: $$C^{\operatorname{Bayes}}(x)=\operatorname{sign}(a(x)).$$ Note that $a:\mathcal{X}\rightarrow \mathbb{R}$ is a real-valued function that is usually continuous. This brings us back to the unified view of classification and regression problems. One can first estimate the log odds $a(x)\in\mathbb{R}$ by regression methods and then take its sign as our estimator of the Bayes classifier.

Loss Functions for Log Odds Estimation #

To follow the empirical risk minimization paradigm, we need to find a loss function that identifies the log odds function as its ideal prediction rule. Like for the support vector machine, we need to extend the label set to be $\mathcal{Y}=\mathbb{R}$ to allow for a real-valued prediction rule $f:\mathcal{X}\rightarrow\mathbb{R}=\mathcal{Y}$ although our target variable $Y\in\{-1,1\}\subset \mathbb{R}$ is binary.

Can we use the 0-1 loss by exploiting the relation between the Bayes classifier and the log odds mentioned above? In particular, consider the following loss function:

$$\begin{align*}\ell(f(x),y)=&\ell_{0-1}(\operatorname{sign}(f(x)),y)\\=&\phi(f(x)\cdot y)\end{align*}$$ with the step function $$\phi(x)=\mathbf{1}[x<0].$$

Note that $f(x)$ is a candidate prediction rule for the log odds $a(x)$ here rather than a classifier.

Unfortunately, this loss function is not a good choice as the ideal prediction rule is not unique: multiplying $f(x)$ with any positive constant yields the same loss. This is not surprising as we lose the information about the magitude of $a(x)$ when converting it into the Bayes classifier.

Similar issues apply to the hinge loss function with a different function $$\phi(x)=\max\{0,1-x\}$$ which is only weakly convex. We omit the details.

To identify both the sign and magnitude of log odds, we need to choose a strictly convex function $\phi$.

Exponential Loss #

$$\ell_{\exp}(f(x),y)=\phi_{\exp}(f(x)\cdot y)$$ with the convex function $\phi_{\exp}$ given by $$\phi_{\exp}(x)=\exp\left(-\frac{1}{2}x\right).$$

Another possibility is the logistic loss function. Using the axioms of probability, $$\begin{align*}a(x)=&\log \frac{\mathbb{P}(Y=1|X=x)}{1-\mathbb{P}(Y=1|X=x)}\\=&\operatorname{logit}(\mathbb{P}(Y=1|X=x))\end{align*}$$ where $\operatorname{logit}$ denotes the logit function given by $$\operatorname{logit}(p)=\log \frac{p}{1-p}.$$ Reversely, $$\mathbb{P}(Y=1|X=x)=\sigma(a(x))$$ for the sigmoid function $$\sigma(x)=\frac{1}{1+\exp(-x)}.$$ Using the axioms of probability again, $$\begin{align*}\mathbb{P}(Y=-1|X=x)=&1-\sigma(a(x))\\=&\sigma(-a(x)).\end{align*}$$ Using these reparameterization of the posterior probabilities that are ideal for the cross-entropy loss, we can show that the log odds $a(x)$ is also the ideal prediction rule for the logistic loss:

Logistic Loss #

$$\begin{align*}&\ell_{\operatorname{Logit}}(f(x),y)\\=&\ell_{\operatorname{CE}}\left(\begin{pmatrix}\sigma(f(x))\\\sigma(-f(x))\end{pmatrix},\begin{pmatrix}\mathbf{1}[y=1]\\\mathbf{1}[y=-1]\end{pmatrix}\right)\\=&\begin{cases}-\log\sigma(f(x))&y=1\\-\log\sigma(-f(x))&y=-1\end{cases}\\=&-\log\sigma(f(x)\cdot y)\\=&\phi_{\operatorname{Logit}}(f(x)\cdot y)&\end{align*}$$ where $\ell_{\operatorname{CE}}$ is the cross-entroy loss, $\sigma$ is the sigmoid function, and the strictly convex function $\phi_{\operatorname{Logit}}$ is given by $$\phi_{\operatorname{Logit}}(x)=\log (1+\exp(-x)).$$

Previous Section: Least Squares Boosting
Next Section: AdaBoost