Europe/Lisbon

Ben Edelman

, Harvard University

How do deep neural networks learn to construct useful features? Why do self-attention-based networks such as transformers perform so well on combinatorial tasks such as language learning? Why do some capabilities of networks emerge "discontinuously" as the computational resources used for training are scaled up? We will present perspectives on these questions through the lens of a particular class of simple synthetic tasks: learning sparse boolean functions. In part one, we will show that the hypothesis class of one-layer transformers can learn these functions in a statistically efficient manner. This leads to a view of each layer of a transformer as creating new "variables" out of sparse combinations of the previous layer's outputs. In part two, we will focus on the classic task of learning sparse parities, which is statistically easy but computationally difficult. We will demonstrate that SGD on various neural networks (transformers, MLPs, etc.) successfully learns sparse parities, with computational efficiency that is close to known lower bounds. Moreover, the training curves display no apparent progress for a long time, and then quickly drop late in training. We show that despite this apparent delayed breakthrough in performance, hidden progress is actually being made throughout the course of training.

Based on joint work with Surbhi Goel, Sham Kakade, Cyril Zhang, Boaz Barak, and Eran Malach:
https://arxiv.org/abs/2110.10090
https://arxiv.org/abs/2207.08799