09.10.2024 (Wednesday)

DS Why do overparameterized networks generalize well?

regular seminar Peter Latham (UCL)

at:
13:30 - 14:30
KCL, Strand
room: S5.20
abstract:

Most modern deep networks are overparameterized: the number of training
examples, P, is much smaller than the number of parameters, N. According
to classical learning theory, these kinds of overparameterized networks
should overfit, but they tend not to: increasing both depth and width
almost always decreases generalization error. While we don't have a
complete theory of why this happens, we do have a theory of why it should
not be surprising. The theory draws heavily on linear regression, y=wx,
where it's well known that generalization error can be small for
overparameterized models if the true weight, w, lies in the subspace
spanned by eigenvectors with large eigenvalue, and the eigenvalue spectrum
is sufficiently nonuniform. Our main contribution is to calculate the
eigenvalues spectrum of the linearized dynamics of deep networks and show
that for large N and P the spectrum is approximately power law -- at any
point in learning.

Keywords: