Europe/Lisbon —

Ard Louis

Ard Louis, University of Oxford

One of the most surprising properties of deep neural networks (DNNs) is that they perform best in the overparameterized regime. We are taught early on that having more parameters than data points is a terrible idea. So why do DNNs work so well in a regime where classical learning theory predicts they should heavily overfit? By adapting the coding theorem from algorithmic information theory (which every physicist should learn about!) we show that DNNs are exponentially biased at initialisation to functions that have low descriptional (Kolmogorov) complexity. In other words, DNNs have an inbuilt Occam's razor, a bias towards simple functions. We next show that stochastic gradient descent (SGD), the most popular optimisation method for DNNs, follows the same bias, and so does not itself explain the good generalisation of DNNs. Our approach naturally leads to a marginal-likelihood PAC-Bayes generalisation bound which performs better than any other bounds on the market. Finally, we discuss why this bias towards simplicity allows DNNs to perform so well, and speculate on what this may tell us about the natural world.