Shallow Neural Network
Feedforward Networks
Why Nonlinear Models # Consider a scalar target variable $Y\in\mathbb{R}$ and two independent dummy features $$X=(X_1,X_2)\in\{0,1\}^2.$$Suppose that $$\mathbb{P}(X_j=1)=\mathbb{P}(X_j=0)=0.5,~j\in\{1,2\},$$ and the true regression function equals to the Exclusive Or (XOR) function given by $$\mu(x)=\mathbf{1}[x_1\neq x_2].$$ However, we do not know this population regression function but restrict ourselves to the linear models for convenience. In other words, we only consider the predition rule $f$ from the space \begin{align*}\mathcal{H}=\{ &f(x;\theta)=\beta_1x_1+\beta_2x_2+\beta_0:\\& \theta=(\beta_0,\beta_1,\beta_2)\in\mathbb{R}^{3}\}\end{align*} which excludes the XOR function unfortunately. ...Learn More>
Universal Approximation
Universal Approximation Theorem # The XOR function is merely an example showing the limitation of linear models. In real-life problems, we do not know the true regression function, which can be (highly) nonlinear in many situations. The collection of neural networks forms a systematic model thanks to their universal approximation property. For any sufficiently smooth function $\mu$ on a compact set with finitely many discontinuities, there exists a feedforward network $f$ that can approximate it arbitrarily well if: ...Learn More>
Multiple Outputs
Shallow Feedforward Networks # We can stack scalar feedforward networks to allow for multiple outputs. This extension is essential when dealing with multivariate regression functions for one-hot encoded targets. From now on, we call this general form shallow feedforward networks. A shallow feedforward network is a function $f:\mathbb{R}^{d}\rightarrow \mathbb{R}^K$ of the form: $$f(x)=\begin{pmatrix}f_{1}(x)\\\vdots\\f_{K}(x)\end{pmatrix} =\boldsymbol{\sigma} \begin{pmatrix}a_{1}^{(2)}(x)\\\vdots\\ a_{K}^{(2)}(x)\end{pmatrix},$$ where we extend the notation of $\boldsymbol{\sigma}:\mathbb{R}^K\rightarrow\mathbb{R}^{K}$ to allow it to be: The softmax function $\boldsymbol{\sigma}(a)=\operatorname{SoftMax}(a)$ for classification problems; or The identity function $\boldsymbol{\sigma}(a)=a$ for regression problems; or $\boldsymbol{\sigma}(a)=(\sigma(a_1),\ldots,\sigma(a_K))^T$ for some element-wise activation $\sigma$. ...Learn More>
Training Shallow Neural Networks
Weight Decay # The training algorithms of neural networks follow the empirical risk minimization paradigm. Given the network architecture (i.e., the number of units for all layers) and the activation function, we parameterize the shallow feedforward network by $$f(x;w,b)$$ where the vector $w\in\mathbb{R}^{KM+Md}$ collects all the weights and the vector $b\in\mathbb{R}^{M+K}$ collects the biases. The weight decay method controls the total weight length $\left\|w\right\|^2=\sum_{i,j,l}(w_{i,j}^{(l)})^2$, and minimizes the penalized empirical risk ...Learn More>
Implicit Regularization
The weight decay method is an example of the so-called explicit regularization methods. For neural networks, implicit regularization is also popular in applications for their effectiveness and simplicity despite their less developed theoretical properties. In this section, we discuss three implicit regularization methods. Stochastic Gradient Descent # One first example is the stochastic gradient descent (SGD) method. The SGD method uses a minibatch $B_t\subset \{1,\ldots,n\}$ each time to replace the batch gradient $g(w(t),b(t))$ with a minibatch gradient given by ...Learn More>