Regularization is calming down your model so that it doesn’t start modeling noise. For any data we are interested in, we can always break them down into pattern plus noise. Regularization enable our model to capture the pattern without too much noise at the cost of model flexibility. One way to do so is to suppress the model weights from rising too high, so that the model predictions look smooth and even out.
The ultimate goal of regularization is to improve generalization. In L1 or L2 regularization for example, the regularization term prevents model parameters from getting too big. The weights are encouraged to spread out evenly among all parameters. This prevents the model from putting too much focus on any particular parameters.
In practice, when we are dealing with data that are not complex enough to prevent overfitting, it might be tempting to choose a smaller network which has smaller capacity. However, it is always better to use a larger network that has a strong Regularization in this case. The reason for this has to do with optimization. In a simple model with fewer parameters, although it is usually faster to converge to a local minima, the local minima is often a bad one. There are large variance in such local minima and we have a high chance of falling into one with huge loss. In a more expressive model with more parameters, there are usually a lot more local minima that take longer to converge to, but the variance of their loss is small. With proper regularization, we can usually fall into a good one. In a sentence, Regularization is the preferred way to control overfitting in a Neural Network.
Regularization adds a penalty term to final Loss Function. The penalty term usually has one hyper-parameter $\lambda$ that controls the degree of model flexibility, and usually is a function of all parameters that accounts for their size. The value of $\lambda$ is often chosen by cross validation.
Note that to use Regularization, we usually need to normalize the data.
L2-norm always keeps all the parameters non-zero.
L1-norm tends to shrink less important feature’s coefficient to zero, effectively removing the feature from the model. This is a way to achieve feature selection.