Why ReLU Function

In deep learning, the ReLU activation function $\sigma(x)=\max\{0,x\}$ is much more common than others.

Thresholding Effects #

One reason is that its derivative $\sigma'(x)=\mathbf{1}[x>0]$ is binary, using only the signals beyond a certain threshold. Strictly speaking, $\operatorname{ReLU}$ is not differentiable at the origin, but we can set $\sigma'(0)=0$ artificially in the gradient descent algorithms.

The zero derivatives generate similar effects as dropout by shutting down some neurons - but not randomly - in the gradient calculation.

Projection Property #

Furthermore, the ReLU function enjoys the so-called projection property:

$$\sigma(\sigma(x))=\sigma(x).$$

Applying activations many times does not change the signal, which can pass through several layers without change:

$$\underbrace{\sigma(\sigma(\ldots \sigma}_{k~\text{times}}(x)))=\sigma(x).$$

In contrast, the sigmoid function $\sigma(x)=\frac{1}{1+\exp(-x)}$ violates this property and we lose signals when applying it too many times through layers:

$$\underbrace{\sigma(\sigma(\ldots \sigma}_{k~\text{times}}(x)))\rightarrow 0.659\ldots,~\text{as}~k\rightarrow\infty.$$

The limit is a constant, making the input $x$ irrelevant. The same holds for the hyperbolic tangent function, except the constant is different.