In deep learning, the ReLU activation function $\sigma(x)=\max\{0,x\}$ is much more common than others.
Thresholding Effects #
One reason is that its derivative $\sigma'(x)=\mathbf{1}[x>0]$ is binary, using only the signals beyond a certain threshold. Strictly speaking, $\operatorname{ReLU}$ is not differentiable at the origin, but we can set $\sigma'(0)=0$ artificially in the gradient descent algorithms.
The zero derivatives generate similar effects as dropout by shutting down some neurons - but not randomly - in the gradient calculation.
Projection Property #
Furthermore, the ReLU function enjoys the so-called projection property:
$$\sigma(\sigma(x))=\sigma(x).$$
Applying activations many times does not change the signal, which can pass through several layers without change:
$$\underbrace{\sigma(\sigma(\ldots \sigma}_{k~\text{times}}(x)))=\sigma(x).$$
In contrast, the sigmoid function $\sigma(x)=\frac{1}{1+\exp(-x)}$ violates this property and we lose signals when applying it too many times through layers:
$$\underbrace{\sigma(\sigma(\ldots \sigma}_{k~\text{times}}(x)))\rightarrow 0.659\ldots,~\text{as}~k\rightarrow\infty.$$
The limit is a constant, making the input $x$ irrelevant. The same holds for the hyperbolic tangent function, except the constant is different.