#### Weight Decay #

The training algorithms of neural networks follow the empirical risk minimization paradigm. Given the network architecture (i.e., the number of units for all layers) and the activation function, we parameterize the shallow feedforward network by
$$f(x;w,b)$$
where the vector $w\in\mathbb{R}^{KM+Md}$ collects all the *weights* and the vector $b\in\mathbb{R}^{M+K}$ collects the *biases*.

The *weight decay* method controls the total weight length $\left\|w\right\|^2=\sum_{i,j,l}(w_{i,j}^{(l)})^2$, and minimizes the
penalized empirical risk

$$\begin{align*}&L_S(w,b) +\frac{\lambda}{2}\left\|w\right\|^2\\&=\frac{1}{n}\sum_{i=1}^{n} \ell(f(X_i;w,b),Y_i)+\frac{\lambda}{2}\left\|w\right\|^2,~\lambda\geq 0.\end{align*}$$

We do *not* penalize the biases for ensuring our estimation is shift-equivariant in every layer. Moreover, we typically choose $\ell$ as the half squared loss for the regression problems but cross-entropy loss for classification problems.

#### Gradient Descent #

It is difficult, if possible, to solve the weight decay estimator analytically, so we resort to numerical solutions. We typically use an iterative method. Starting with an initial value $(w(0),b(0))$, the *batch gradient descent* algorithm updates the parameters for each step $t$:

$$ \begin{bmatrix}w(t+1)\\b(t+1)\end{bmatrix} =\begin{bmatrix}w(t)\\b(t)\end{bmatrix} -\eta\cdot g(w(t),b(t))$$

where the hyperparameter $\eta>0$ is called the **learning rate** and $g$ is the gradient function given by

$$ \begin{align*}&g(w,b)\\=& \begin{bmatrix}\nabla_{w} (L_{S}(w,b)+\frac{\lambda}{2}\left\|w\right\|^2 )\\\nabla_{b} (L_{S}(w,b)+\frac{\lambda}{2}\left\|w\right\|^2 )\end{bmatrix} \\=&\frac{1}{n}\sum_{i=1}^{n} \begin{bmatrix} \nabla_{w}\ell(f(X_i;w,b),Y_i)\\ \nabla_{b}\ell(f(X_i;w,b),Y_i) \end{bmatrix} +\lambda\begin{bmatrix}w\\\mathbf{0}\end{bmatrix}.\end{align*}$$

When the algorithm converges to a local minimum, meaning that $w(t+1)\approx w(t)$ and $b(t+1)\approx b(t)$ for large $t$, the update rule implies the first-order condition $g(w(t),b(t))\approx 0$. To compute the gradients $\nabla_{w}\ell(f(X_i;w,b),Y_i)$ and $\nabla_{b}\ell(f(X_i;w,b),Y_i)$, one needs to use backpropagation or automatic differentiation. We do not pursue more details here, but refer interested readers to Section 5.3 of Bishop(2006) or Section 18.2 of Efron and Hastie (2021) for backpropagation, and Baydin et al.(2018) for automatic differentiation.

#### Weight Space Symmetry #

There is **no guarantee** that our solution will be a global optimum. We can swap the weights by permuting the hidden units while keeping the outputs the same, and this property is called the *weight-space symmetry*. The lack of perfect identification creates many local minimum points, and so the penalized errors become **non-convex** in the weights $
\{w_{m,j}^{(1)}\}$ for the hidden layer. A typical symptom is that the prediction may be sensitive to the choice of the initial values for our gradient descent method. In practice, we may try different (random) initial values and then average the predictions.