Planned seminars


Ben Edelman

, Harvard University

How do deep neural networks learn to construct useful features? Why do self-attention-based networks such as transformers perform so well on combinatorial tasks such as language learning? Why do some capabilities of networks emerge "discontinuously" as the computational resources used for training are scaled up? We will present perspectives on these questions through the lens of a particular class of simple synthetic tasks: learning sparse boolean functions. In part one, we will show that the hypothesis class of one-layer transformers can learn these functions in a statistically efficient manner. This leads to a view of each layer of a transformer as creating new "variables" out of sparse combinations of the previous layer's outputs. In part two, we will focus on the classic task of learning sparse parities, which is statistically easy but computationally difficult. We will demonstrate that SGD on various neural networks (transformers, MLPs, etc.) successfully learns sparse parities, with computational efficiency that is close to known lower bounds. Moreover, the training curves display no apparent progress for a long time, and then quickly drop late in training. We show that despite this apparent delayed breakthrough in performance, hidden progress is actually being made throughout the course of training.

Based on joint work with Surbhi Goel, Sham Kakade, Cyril Zhang, Boaz Barak, and Eran Malach:


Sara A. Solla

, Northwestern University | NU · Department of Neuroscience; Department of Physics and Astronomy

The ability to simultaneously record the activity from tens to hundreds to thousands of neurons has allowed us to analyze the computational role of population activity as opposed to single neuron activity. Recent work on a variety of cortical areas suggests that neural function may be built on the activation of population-wide activity patterns, the neural modes, rather than on the independent modulation of individual neural activity. These neural modes, the dominant covariation patterns within the neural population, define a low dimensional neural manifold that captures most of the variance in the recorded neural activity. We refer to the time-dependent activation of the neural modes as their latent dynamics and argue that latent cortical dynamics within the manifold are the fundamental and stable building blocks of neural population activity.


Valentin De Bortoli

, Center for Sciences of Data, ENS Ulm, Paris

Generative modeling is the task of drawing new samples from an underlying distribution known only via an empirical measure. There exists a myriad of models to tackle this problem with applications in image and speech processing, medical imaging, forecasting and protein modeling to cite a few. Among these methods diffusion models are a new powerful class of generative models that exhibit remarkable empirical performance. They consist of a “noising” stage, whereby a diffusion is used to gradually add Gaussian noise to data, and a generative model, which entails a “denoising” process defined by approximating the time-reversal of the diffusion. In this talk we discuss three aspects of diffusion models. First, we will dive into the methodology behind diffusion models. Second, we will present some of their theoretical guarantees with an emphasis on their behavior under the so-called manifold hypothesis. Such theoretical guarantees are non-vacuous and provide insight on the empirical behavior of these models. Finally, I will present an extension of diffusion models to the Optimal Transport setting and introduce Diffusion Schrodinger Bridges.


Memming Park

, Champalimaud Foundation

Neural dynamical systems with stable attractor structures such as point attractors and continuous attractors are widely hypothesized to underlie meaningful temporal behavior that requires working memory. However, perhaps counterintuitively, having good working memory is not sufficient for supporting useful learning signals that are necessary to adapt to changes in the temporal structure of the environment. We show that in addition to the well-known continuous attractors, the periodic and quasi-periodic attractors are also fundamentally capable of supporting learning arbitrarily long temporal relationships. Due to the fine tuning problem of the continuous attractors and the lack of
temporal fluctuations, we believe the less explored quasi-periodic attractors are uniquely qualified for learning to produce temporally structured behavior. Our theory has wide implications for the design of artificial learning systems, and makes predictions on the observable signatures of biological neural dynamics that can support temporal dependence learning. Based on our theory, we developed a new initialization scheme for artificial recurrent neural networks which outperforms standard methods for tasks that require learning temporal dynamics. Finally, we speculate on their biological implementations and make predictions on neuronal dynamics.