1st Israeli Phys-4-DL Day

Applying physics-style research methodologies to theory and phenomenology of deep learning


Location: Steinhardt Museum of Natural History, Tel Aviv University

Tuesday, June 18, 2024 Add to Calendar


We are happy to announce the 1st Israeli Phys-4-DL meeting! The aim is to have an informal gathering, bringing together the local folks interested in applying physics-style research methodologies to deep learning. The meeting is sponsored by The Center for AI & Data Science in Tel Aviv University.

Program (click for abstracts)

9:30 Coffee, light refreshments and mingling
The last several years have seen many new exciting developments in microscopic theories of DNNs: NTK, Eigen-learning, Kernel Scaling, Saad & Sola approaches, Kernel Adaptation, and DMFT, to name a few. Still, having so many drastically different formal approaches, makes it difficult to draw parallels and insights and limits cross-talk in the field. In condensed matter physics, a field which faced similar issues, field theory has emerged not only as a powerful analytical tool but also as a common meeting ground. In the first half of this talk, I’ll give a pedagogical introduction to how two-layer DNNs in the feature learning regime can be written as a field theory and illustrate how this may unify many of those different approaches. I’ll then describe how one can map the phenomena of Grokking to that of first-order phase transitions using this formalism and discuss its implications on sample complexity in the proportional scaling regime.
Deep learning is delivering unprecedented performance when applied to various data modalities, yet there are data distributions over which it utterly fails. The question of what makes a data distribution suitable for deep learning is a fundamental open problem in the field. In this talk I will present a recent theory aiming to address the problem via tools from quantum physics. The theory establishes that certain neural networks are capable of accurate prediction over a data distribution if and only if the data distribution admits low quantum entanglement under certain partitions of features. This brings forth practical methods for adaptation of data to neural networks, and vice versa. Experiments with widespread models over various datasets will demonstrate the findings. An underlying theme of the talk will be the potential of physics to advance our understanding of the relation between deep learning and real-world data. The talk is based on NeurIPS 2023 papers co-authored with my students Noam Razin, Yotam Alexander, Nimrod De La Vega and Tom Verbin.
Neural networks has recently attracted much interest as useful representations of quantum many body ground states. Most attention was directed at their representability properties, while possible limitations on finding the desired optimal state have not been suitably explored. By leveraging well-established results applicable in the context of infinite width, specifically regarding the renowned neural tangent kernel and conjugate kernel, a comprehensive analysis of the convergence and initialization characteristics of the method is conducted. The paper illustrates the dependence of these characteristics on the interplay among these kernels, the Hamiltonian, and the basis used for its representation. We introduce and motivate novel performance metrics and explore the condition for their optimization. By leveraging these findings, we elucidate a substantial dependence of the effectiveness of this approach on the selected basis, demonstrating that so-called “stoquastic” Hamiltonians are more amenable to solution through neural networks.
Deep learning models, such as wide neural networks, can be conceptualized as nonlinear dynamical physical systems characterized by a multitude of interacting degrees of freedom. Such systems, in the limit of an infinite number of degrees of freedom, tend to exhibit simplified dynamics. This paper delves into gradient descent-based learning algorithms that display a linear structure in their parameter dynamics, reminiscent of the neural tangent kernel. We establish that this apparent linearity arises due to weak correlations between the first and higher-order derivatives of the hypothesis function with respect to the parameters taken around their initial values. This insight suggests that these weak correlations could be the underlying cause behind the observed linearization in such systems. As a case in point, we showcase this weak correlations structure within neural networks in the large width limit. Utilizing this relationship between linearity and weak correlations, we derive a bound on the deviation from linearity during the training trajectory of stochastic gradient descent. To facilitate our proof, we introduce a novel method to characterize the asymptotic behavior of random tensors. We empirically verify our findings and present a comparison between the linearization of the system and the observed correlations.
12:00 Light lunch
How is it that we can describe the complex universe with simple, fundamental rules?
The equivalence between vastly different, complex physical systems, when observed from afar, allows us to make accurate predictions without analyzing the microscopic details. Conversely, by reducing such systems to their minimal constituents, we can describe phenomena that would otherwise seem inscrutable. In this talk, I will discuss how these notions of universality and reductionism extend beyond the natural universe, to the synthetic world of neural networks. First, I will discuss how aspects of universality appear in the data used to train models, as well as in the models themselves. I will then adopt a reductionist approach to explain emergent phenomena in deep learning, taking Grokking, or delayed generalization, as a case study. Finally, I will explore how aspects of criticality in physical systems correlate with some surprising features of neural networks.
A key property of deep neural networks (DNNs) is their ability to learn new features during training. This intriguing aspect of deep learning stands out most clearly in recently reported Grokking phenomena. While mainly reflected as a sudden increase in test accuracy, Grokking is also believed to be a beyond lazylearning/Gaussian Process (GP) phenomenon involving feature learning. Here we apply a recent development in the theory of feature learning, the adaptive kernel approach, to two teacher-student models with cubic-polynomial and modular addition teachers. We provide analytical predictions on feature learning and Grokking properties of these models and demonstrate a mapping between Grokking and the theory of phase transitions. We show that after Grokking, the state of the DNN is analogous to the mixed phase following a first-order phase transition. In this mixed phase, the DNN generates useful internal representations of the teacher that are sharply distinct from those before the transition.
A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero error (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifying the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. I will talk about our new paper, where we prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow “teacher NN” that agrees with the labels. Specifically, we show that such a ‘flat’ prior over the NN parametrization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent — enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student’s.
Recurrent neural networks (RNNs) are powerful tools for studying learning and computations in the brain. A classical type of RNN studied in neuroscience is the attractor network, which maps external states to stable internal network states known as attractors. These networks serve as canonical models for diverse brain functions like motor control, sensory amplification, memory, decision-making, and spatial navigation. For example, continuous attractor networks are a dominant model used for understanding how analog variables, such as position or head direction, are encoded in the brain. Within these networks, variables are represented along a continuum of persistent neuronal states, forming a manifold attractor. The prevailing framework supporting such attractor manifolds in RNNs assumes a symmetric connectivity structure of the network. However, this is inconsistent with the diverse synaptic connectivity and neuronal representations observed in experiments. I will present a new theory of manifold attractors that do not rely on symmetries in connectivity but instead emerge in trained RNNs. In such trained networks, a continuous representational space emerges from a small set of stimuli used for training, reminiscent of the inductive bias property of neural networks. Moreover, the theory shows how the geometry of the representation and the level of synaptic heterogeneity affect the network's response to external inputs in an analytically tractable way. Our work demonstrates how manifold attractors can cope with diverse neuronal responses, imperfections in the geometry of the manifold attractor, and a high level of synaptic heterogeneity. It suggests that the functional properties of manifold attractors in the brain can be inferred from the overlooked asymmetries in connectivity and in the low-dimensional representation of the encoded variable.
15:10 Coffee and cookies
The problem of auto-associative memory was historically addressed by constructing neural networks with gradient-based dynamics, where the dynamics follow an energy function. From the seminal work of Hopfield through its many extensions, memory stability is achieved by designing the energy so that memories are close to local minima, but such dynamics are remote from biological plausibility (e.g., might assume symmetric connectivity) and thus fail to offer insights about auto-associative memory in the brain. We take a different path by asking whether a simple, biologically plausible, dynamical system can solve this task. We provide an affirmative answer to this question by framing it as an optimisation problem and showing in simulations it is possible to store memories in the network connectivity such that they are stable fixed points of the recurrent dynamics. Using the replica method from statistical physics, we derive a new analytical theory for the ability to store sparse, graded activity patterns as fixed points of recurrent dynamics and – crucially – for the dynamic stability of those memories during recall. We show how memory stability depends on surprising factors such as the shape of neurons’ input-output function, thus making concrete, experimentally testable predictions about brain circuits hypothesised to be involved in a biological implementation of auto-associative memory. On a broader perspective, questions of dynamical stability in recurrent neural networks were usually addressed only for networks with random connectivity, whereas we provide a novel analytical method capable of describing cases where the connectivity is a non-random solution to an optimisation problem.
We study inductive bias in Transformers in the infinitely over-parameterized Gaussian process limit and argue transformers tend to be biased towards more permutation symmetric functions in sequence space. We show that the representation theory of the symmetric group can be used to give quantitative analytical predictions when the dataset is symmetric to permutations between tokens. We show that in common setups, one can derive tight bounds in the form of a scaling law for the learnability as a function of the context length. Finally, we argue WikiText dataset, does indeed possess a degree of permutation symmetry.
This talk explores the intrinsic low-dimensional nature of real-world tasks and environments and its implications for neural network structure. I will discuss two key studies highlighting this principle in deep and recurrent neural networks. The first study demonstrates that low-dimensional error signals suffice to effectively train deep network layers, simplifying computational demands and yielding biologically plausible features. The second study shows that large low-rank neural networks can approximate arbitrary dynamical systems, with error bounds decreasing exponentially with the rank. Together, these studies reveal that aligning network architectures with the low-dimensional constraints of tasks enhances efficiency and biological relevance.
17:15 Farewell

Organizers

Zohar Ringel

Yohai Bar Sinai

TAU-DS logo