Mechanics of deep neural networks beyond the Gaussian limit

Fischer, Kirsten Margaret; Helias, Moritz; Krämer, Michael

doi:urn:nbn:de:hbz:5:2-1451365

TY - THES
AU - Fischer, Kirsten Margaret
TI - Mechanics of deep neural networks beyond the Gaussian limit
VL - 110
PB - RWTH Aachen University
VL - Dissertation
CY - Jülich
M1 - RWTH-2025-03269
SN - 978-3-95806-815-5
T2 - Schriften des Forschungszentrums Jülich. Reihe Information
SP - 1 Online-Ressource (xvi, 138 Seiten) : Illustrationen, Diagramme
PY - 2025
N1 - Druckausgabe: 2025. - Onlineausgabe: 2025. - Auch veröffentlicht auf dem Publikationsserver der RWTH Aachen University
N1 - Dissertation, RWTH Aachen University, 2025
AB - Current developments in the field of artificial intelligence and the neural network technology supersede our theoretical understanding of these networks. In the limit of infinite width, networks at initialization are well described by the neural network Gaussian process (NNGP): the distribution of outputs is a zero-mean Gaussian characterized by its covariance or kernel across data samples. Going to the lazy learning regime, where network parameters change only slightly from their initial values, the neural tangent kernel characterizes networks trained with gradient descent. Despite the success of these Gaussian limits for deep neural networks, they do not capture important properties such as network trainability or feature learning. In this work, we go beyond Gaussian limits of deep neural networks by obtaining higher-order corrections from field-theoretic descriptions of neural networks. From a statistical point of view, two complimentary averages have to be considered: the distribution over data samples and the distribution over network parameters. We investigate both cases, gaining insights into the working mechanisms of deep neural networks. In the former case, we study how data statistics are transformed across network layers to solve classification tasks. We find that, while the hidden layers are well described by a non-linear mapping of the Gaussian statistics, the input layer extracts information from higher-order cumulants of the data. The developed theoretical framework allows us to investigate the relevance of different cumulant orders for classification: On MNIST, Gaussian statistics account for most of the classification performance, and higher-order cumulants are required to fine-tune the networks for the last few percentages. In contrast, more complex data sets such as CIFAR-10 require the inclusion of higher-order cumulants for reasonable performance values, giving an explanation for why fully-connected networks perform subpar compared to convolutional networks. In the latter case, we investigate two different aspects: First, we derive the network kernels for the Bayesian network posterior of fully-connected networks and observe a non-linear adaptation of the kernels to the target, which is not present in the NNGP. These feature corrections result from fluctuation corrections to the NNGP in finite-size networks, which allow the networks to adapt to the data. While fluctuations become larger near criticality, we uncover a trade-off between criticality and feature learning scales in networks as a driving mechanism for feature learning. Second, we study network trainability of residual networks by deriving the network prior at initialization. From this, we obtain the response function as a leading-order correction to the NNGP, which describes the signal propagation in networks. We find that scaling the residual branch by a hyperparameter improves signal propagation since it avoids saturation of the non-linearity and thus information loss. Finally, we observe a strong dependence of the optimal scaling of the residual branch on the network depth but only a weak dependence on other network hyperparameters, giving an explanation for the universal success of depth-dependent scaling of the residual branch. Overall, we derive statistical field theories for deep neural networks that allow us to obtain systematic corrections to the Gaussian limits. In this way, we take a step towards a better mechanistic understanding of information processing and data representations in neural networks.
LB - PUB:(DE-HGF)11 ; PUB:(DE-HGF)3
DO - DOI:10.18154/RWTH-2025-03269
UR - https://publications.rwth-aachen.de/record/1008953
ER -

h1

h2

h3

h4

h5

h6

RWTH

Kontakt

RWTH Publications

Allgemeines