Deep learning has transformed Machine Learning and Artificial Intelligence in the past decade. It raises fundamental questions for mathematics and theory of computer science, since it relies upon solving large-scale nonconvex problems via gradient descent and its variants. This talk will be an introduction to mathematical questions raised by deep learning, and some partial understanding obtained in recent years with respect to optimization, generalization, self-supervised learning, privacy etc.

Given a set of distances amongst points, determining what metric representation is most "consistent" with the input distances or the metric that best captures the relevant geometric features of the data is a key step in many machine learning algorithms. In this talk, we focus on 3 specific metric constrained problems, a class of optimization problems with metric constraints: metric nearness (Brickell et al. (2008)), weighted correlation clustering on general graphs (Bansal et al. (2004)), and metric learning (Bellet et al. (2013); Davis et al. (2007)).

Because of the large number of constraints in these problems, however, these and other researchers have been forced to restrict either the kinds of metrics learned or the size of the problem that can be solved. We provide an algorithm, PROJECT AND FORGET, that uses Bregman projections with cutting planes, to solve metric constrained problems with many (possibly exponentially) inequality constraints. We also prove that our algorithm converges to the global optimal solution. Additionally, we show that the optimality error decays asymptotically at an exponential rate. We show that using our method we can solve large problem instances of three types of metric constrained problems, out-performing all state of the art methods with respect to CPU times and problem sizes.

Finally, we discuss the adaptation of PROJECT AND FORGET to specific types of metric constraints, namely tree and hyperbolic metrics.

In this talk I will review essentials of quantum field theory (QFT) and demonstrate how the function-space distribution of many neural networks (NNs) shares similar properties. This allows, for instance, computation of correlators of neural network outputs in terms of Feynman diagrams and a direct analogy between non-Gaussian corrections in NN distributions and particle interactions. Some cases yield divergences in perturbation theory, requiring the introduction of regularization and renormalization. Potential advantages of this perspective will be discussed, including a duality between function-space and parameter-space descriptions of neural networks.

Graph neural networks (GNNs) have become the standard toolkit for analyzing and learning from data on graphs. As the field grows, it becomes critical to identify key architectures and validate new ideas that generalize to larger, more complex datasets. Unfortunately, it has been increasingly difficult to gauge the effectiveness of new models in the absence of a standardized benchmark with consistent experimental settings. In this work, we introduce a reproducible GNN benchmarking framework, with the facility for researchers to add new models conveniently for arbitrary datasets. We demonstrate the usefulness of our framework by presenting a principled investigation into the recent Weisfeiler-Lehman GNNs (WL-GNNs) compared to message passing-based graph convolutional networks (GCNs) for a variety of graph tasks with medium-scale datasets.

Algorithmic decisions are now being used on a daily basis, and based on Machine Learning (ML) processes that may be complex and biased. This raises several concerns given the critical impact that biased decisions may have on individuals or on society as a whole. Not only unfair outcomes affect human rights, they also undermine public trust in ML and AI. In this talk, we will address fairness issues of ML models based on decision outcomes, and we will show how the simple idea of feature dropout followed by an ensemble approach can improve model fairness without compromising its accuracy. To illustrate we will present a general workflow that relies on explainers to tackle process fairness, which essentially measures a model's reliance on sensitive or discriminatory features. We will present different applications and empirical settings that show improvements not only with respect to process fairness but also other fairness metrics.

Massive data collection holds the promise of a better understanding of complex phenomena and ultimately, of better decisions. An exciting opportunity in this regard stems from the growing availability of perturbation / intervention data (drugs, knockouts, overexpression, etc.) in biology. In order to obtain mechanistic insights from such data, a major challenge is the development of a framework that integrates observational and interventional data and allows predicting the effect of yet unseen interventions or transporting the effect of interventions observed in one context to another. I will present a framework for causal structure discovery based on such data and highlight the role of overparameterized autoencoders. We end by demonstrating how these ideas can be applied for drug repurposing in the current SARS-CoV-2 crisis.

Linear (and generalized linear) regression (LR) is an old, but still essential, statistical tool: its goal is to learn to predict a (response) variable from a linear combination of other (explanatory) variables. A central problem in LR is the selection of relevant variables, because using fewer variables tends to yield better generalization and because this identification may be meaningful (e.g., which genes are relevant to predict a certain disease). In the past quarter-century, variable selection (VS) based on sparsity-inducing regularizers has been a central paradigm, the most famous example being the LASSO, which has been intensively studied, extended, and applied.

In many contexts, it is natural to have highly-correlated variables (e.g., several genes that are strongly co-regulated), thus simultaneously relevant as predictors. In this case, sparsity-based VS may fail: it may select an arbitrary subset of these variables and it is unstable. Moreover, it is often desirable to identify all the relevant variables, not just an arbitrary subset thereof, a goal for which several approaches have been proposed. This talk will be devoted to a recent class of such approaches, called ordered weighted l1 (OWL). The key feature of OWL is that it is provably able to explicitly identify (i.e. cluster) sufficiently-correlated features, without having to compute these correlations. Several theoretical results characterizing OWL will be presented, including connections to the mathematics of economic inequality. Computational and optimization aspects will also be addressed, as well as recent applications in subspace clustering, learning Gaussian graphical models, and deep neural networks.

Identifying the relevant coarse-grained degrees of freedom in a complex physical system is a key stage in developing effective theories. The renormalization group (RG) provides a framework for this task, but its practical execution in unfamiliar systems is fraught with ad hoc choices. Machine learning approaches, on the other hand, though promising, often lack formal interpretability: it is unclear what relation, if any, the architecture- and training-dependent learned "relevant" features bear to standard objects of physical theory. I will present recent results addressing both issues. We develop a fast algorithm, the RSMI-NE, employing state-of-art results in machine-learning-based estimation of information-theoretic quantities to construct the optimal coarse-graining. We use it to develop a new approach to identifying the most relevant field theory operators describing a statistical system, which we validate on the example of interacting dimer model. I will also discuss formal results underlying the method: we establish equivalence between the information-theoretic notion of relevance defined in the Information Bottleneck (IB) formalism of compression theory, and the field-theoretic relevance of the RG. We show analytically that for statistical physical systems the "relevant" degrees of freedom found using IB compression indeed correspond to operators with the lowest scaling dimensions, providing a dictionary connecting two distinct theoretical toolboxes.

The past few decades have witnessed a significant research effort in the field of Lyapunov model based control design. In parallel, optimal control and optimization model based design have also expanded their range of applications, and nowadays, receding horizon approaches can be considered a mature field for particular classes of control systems.

In this talk, I will argue that Lyapunov based techniques play an important role for analysis of model based optimization methodologies and moreover, both approaches can be combined for control design resulting in powerful frameworks with formal guarantees of robustness, stability, performance, and safety. Illustrative examples in the area of motion control of autonomous robotic vehicles will be presented for Autonomous Underwater Vehicles (AUVs), Autonomous Surface Vehicles (ASVs) and Unmanned Aerial Vehicles (UAVs).

We compare the complexity of training classical and quantum machine learning (ML) models for predicting outcomes of physical experiments. The experiments depend on an input parameter x and involve the execution of a (possibly unknown) quantum process $E$. Our figure of merit is the number of runs of $E$ needed during training, disregarding other measures of complexity. A classical ML performs a measurement and records the classical outcome after each run of $E$, while a quantum ML can access $E$ coherently to acquire quantum data; the classical or quantum data is then used to predict outcomes of future experiments. We prove that, for any input distribution $D(x)$, a classical ML can provide accurate predictions on average by accessing $E$ a number of times comparable to the optimal quantum ML. In contrast, for achieving accurate prediction on all inputs, we show that exponential quantum advantage exists in certain tasks. For example, to predict expectation values of all Pauli observables in an $n-$qubit system, we present a quantum ML using only $O(n)$ data and prove that a classical ML requires $2^{\Omega(n)}$ data.

In the last two decades the field of nonequilibrium quantum many-body physics has seen a rapid development driven, in particular, by the remarkable progress in quantum simulators, which today provide access to dynamics in quantum matter with an unprecedented control. However, the efficient numerical simulation of nonequilibrium real-time evolution in isolated quantum matter still remains a key challenge for current computational methods especially beyond one spatial dimension. In this talk I will present a versatile and efficient machine learning inspired approach. I will first introduce the general idea of encoding quantum many-body wave functions into artificial neural networks. I will then identify and resolve key challenges for the simulation of real-time evolution, which previously imposed significant limitations on the accurate description of large systems and long-time dynamics. As a concrete example, I will consider the dynamics of the paradigmatic two-dimensional transverse field Ising model, where we observe collapse and revival oscillations of ferromagnetic order and demonstrate that the reached time scales are comparable to or exceed the capabilities of state-of-the-art tensor network methods.

Many tasks in fluid mechanics, such as design optimization and control, are challenging because fluids are nonlinear and exhibit a large range of scales in both space and time. This range of scales necessitates exceedingly high-dimensional measurements and computational discretization to resolve all relevant features, resulting in vast data sets and time-intensive computations. Indeed, fluid dynamics is one of the original big data fields, and many high-performance computing architectures, experimental measurement techniques, and advanced data processing and visualization algorithms were driven by decades of research in fluid mechanics. Machine learning constitutes a growing set of powerful techniques to extract patterns and build models from this data, complementing the existing theoretical, numerical, and experimental efforts in fluid mechanics. In this talk, we will explore current goals and opportunities for machine learning in fluid mechanics, and we will highlight a number of recent technical advances. Because fluid dynamics is central to transportation, health, and defense systems, we will emphasize the importance of machine learning solutions that are interpretable, explainable, generalizable, and that respect known physics.

In this presentation, I will introduce some traditional Reinforcement Learning problems and algorithms, and analyze how some problems can be avoided and convergence results obtained using a two-time scale variation of the usual stochastic approximation approach.

This variation was inspired by the practical successes of Deep Q-Learning in attaining superhuman performance at some classical Atari games by Deepmind's research team in 2015. Machine Learning practical successes like this often have no corresponding explaining theory. The work that will be presented intends to contribute to that goal.

Joint work with Diogo Carvalho and Francisco Melo from INESC-ID.

Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will explain how to leverage entropic regularization methods to define computationally efficient loss functions, approximating OT with a better sample complexity.

More information and references can be found on the website of our book "Computational Optimal Transport", https://optimaltransport.github.io/

Autonomous robots that can assist humans in situations of daily life have been a long standing vision of robotics, artificial intelligence, and cognitive sciences. A first step towards this goal is to create robots that can learn tasks triggered by environmental context or higher level instruction. However, learning techniques have yet to live up to this promise as only few methods manage to scale to high-dimensional manipulator or humanoid robots. In this talk, we investigate a general framework suitable for learning motor skills in robotics which is based on the principles behind many analytical robotics approaches. It involves generating a representation of motor skills by parameterized motor primitive policies acting as building blocks of movement generation, and a learned task module that transforms these movements into motor commands. We discuss learning on three different levels of abstraction, i.e., learning for accurate control is needed to execute, learning of motor primitives is needed to acquire simple movements, and learning of the task-dependent „hyperparameters“ of these motor primitives allows learning complex tasks. We discuss task-appropriate learning approaches for imitation learning, model learning and reinforcement learning for robots with many degrees of freedom. Empirical evaluations on a several robot systems illustrate the effectiveness and applicability to learning control on an anthropomorphic robot arm. These robot motor skills range from toy examples (e.g., paddling a ball, ball-in-a-cup, juggling) to playing robot table tennis against a human being and manipulation of various objects.

Recent empirical successes of deep learning have exposed significant gaps in our fundamental understanding of learning and optimization mechanisms. Modern best practices for model selection are in direct contradiction to the methodologies suggested by classical analyses. Similarly, the efficiency of SGD-based local methods used in training modern models, appeared at odds with the standard intuitions on optimization.

First, I will present evidence, empirical and mathematical, that necessitates revisiting classical notions, such as over-fitting. I will continue to discuss the emerging understanding of generalization, and, in particular, the "double descent" risk curve, which extends the classical U-shaped generalization curve beyond the point of interpolation.

Second, I will discuss why the landscapes of over-parameterized neural networks are generically never convex, even locally. Instead, as I will argue, they satisfy the Polyak-Lojasiewicz condition across most of the parameter space instead, which allows SGD-type methods to converge to a global minimum.

A key piece of the puzzle remains - how does optimization align with statistics to form the complete mathematical picture of modern ML?

Many challenging image processing tasks can be described by an ill-posed linear inverse problem: deblurring, deconvolution, inpainting, compressed sensing, and superresolution all lie in this framework. Recent advances in machine learning and image processing have illustrated that it is often possible to learn a regularizer from training data that can outperform more traditional approaches by large margins. In this talk, I will describe the central prevailing themes of this emerging area and a taxonomy that can be used to categorize different problems and reconstruction methods. We will also explore mechanisms for model adaptation; that is, given a network trained to solve an initial inverse problem with a known forward model, we propose novel procedures that adapt the network to a perturbed forward model, even without full knowledge of the perturbation. Finally, I will describe a new class of approaches based on "infinite-depth networks" that can yield up to a 4dB PSNR improvement in reconstruction accuracy above state-of-the-art alternatives and where the computational budget can be selected at test time to optimize context-dependent trade-offs between accuracy and computation.

Cyber-physical systems (CPS) comprise interacting digital, analog, physical, and human components engineered for function through integrated physics and logic. Incorporating intelligence in CPS, however, makes their physical components more exposed to adversaries that can potentially cause failure or malfunction through actuation attacks. As a result, augmenting CPS with resilient control and design methods is of grave significance, especially if an actuation attack is stealthy. Towards this end, in the first part of the talk, I will present a receding horizon controller, which can deal with undetectable actuation attacks by solving a game in a moving horizon fashion. In fact, this controller can guarantee stability of the equilibrium point of the CPS, even if the attackers have an information advantage. The case where the attackers are not aware of the decision-making mechanism of one another is also considered, by exploiting the theory of bounded rationality. In the second part of the talk, and for CPS that have partially unknown dynamics, I will present an online actuator placement algorithm, which chooses the actuators of the CPS that maximize an attack security metric. It can be proved that the maximizing set of actuators is found in finite time, despite the CPS having uncertain dynamics.

Most problems in Earth sciences aim to do inferences about the system, where accurate predictions are just a tiny part of the whole problem. Inferences mean understanding variables relations, deriving models that are physically interpretable, that are simple parsimonious, and mathematically tractable. Machine learning models alone are excellent approximators, but very often do not respect the most elementary laws of physics, like mass or energy conservation, so consistency and confidence are compromised. I will review the main challenges ahead in the field, and introduce several ways to live in the Physics and machine learning interplay that allows us (1) to encode differential equations from data, (2) constrain data-driven models with physics-priors and dependence constraints, (3) improve parameterizations, (4) emulate physical models, and (5) blend data-driven and process-based models. This is a collective long-term AI agenda towards developing and applying algorithms capable of discovering knowledge in the Earth system.

Automatic differentiation (autodiff) has revolutionized machine learning. It allows expressing complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization as a layer, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, the formulas for these derivatives often involve case-by-case tedious mathematical derivations. In this work, we propose a unified, efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines (in Python in the case of our implementation) a function F capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of F and implicit differentiation to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many recently proposed implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.

Computational imaging is a rapidly growing area that seeks to enhance the capabilities of imaging instruments by viewing imaging as an inverse problem. There are currently two distinct approaches for designing computational imaging methods: model-based and learning-based. Model-based methods leverage analytical signal properties and often come with theoretical guarantees and insights. Learning-based methods leverage data-driven representations for best empirical performance through training on large datasets. This talk presents Regularization by Artifact Removal (RARE), as a framework for reconciling both viewpoints by providing a learning-based extension to the classical theory. RARE relies on pre-trained “artifact-removing deep neural nets” for infusing learned prior knowledge into an inverse problem, while maintaining a clear separation between the prior and physics-based acquisition model. Our results indicate that RARE can achieve state-of-the-art performance in different computational imaging tasks, while also being amenable to rigorous theoretical analysis. We will focus on the applications of RARE in biomedical imaging, including magnetic resonance and tomographic imaging.

This talk will be based on the following references

J. Liu, Y. Sun, C. Eldeniz, W. Gan, H. An, and U. S. Kamilov, “RARE: Image Reconstruction using Deep Priors Learned without Ground Truth,” IEEE J. Sel. Topics Signal Process., vol. 14, no. 6, pp. 1088-1099, October 2020.

Z. Wu, Y. Sun, A. Matlock, J. Liu, L. Tian, and U. S. Kamilov, “SIMBA: Scalable Inversion in Optical Tomography using Deep Denoising Priors,” IEEE J. Sel. Topics Signal Process., vol. 14, no. 6, pp. 1163-1175, October 2020.

J. Liu, Y. Sun, W. Gan, X. Xu, B. Wohlberg, and U. S. Kamilov, “SGD-Net: Efficient Model-Based Deep Learning with Theoretical Guarantees,” IEEE Trans. Comput. Imag., in press.

This work develops a class of relaxations in between the big-M and convex hull formulations of disjunctions, drawing advantages from both. We show that this class leads to mixed-integer formulations for trained ReLU neural networks. The approach balances model size and tightness by partitioning node inputs into a number of groups and forming the convex hull over the partitions via disjunctive programming. At one extreme, one partition per input recovers the convex hull of a node, i.e., the tightest possible formulation for each node. For fewer partitions, we develop smaller relaxations that approximate the convex hull, and show that they outperform existing formulations. Specifically, we propose strategies for partitioning variables based on theoretical motivations and validate these strategies using extensive computational experiments. Furthermore, the proposed scheme complements known algorithmic approaches, e.g., optimization-based bound tightening captures dependencies within a partition.

Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term, such as Shannon entropy, Tsallis entropy, and log-barrier functions. Focusing on an infinite-horizon discounted Markov decision process, this talk first shows that entropy-regularized natural policy gradient methods converge globally at a linear convergence that is near independent of the dimension of the state-action space. Next, a generalized policy mirror descent algorithm is proposed to accommodate a general class of convex regularizers beyond Shannon entropy. Encouragingly, this general algorithm inherits similar convergence guarantees, even when the regularizer lacks strong convexity and smoothness. Our results accommodate a wide range of learning rates, and shed light upon the role of regularization in enabling fast convergence in RL.

One of the most surprising properties of deep neural networks (DNNs) is that they perform best in the overparameterized regime. We are taught early on that having more parameters than data points is a terrible idea. So why do DNNs work so well in a regime where classical learning theory predicts they should heavily overfit? By adapting the coding theorem from algorithmic information theory (which every physicist should learn about!) we show that DNNs are exponentially biased at initialisation to functions that have low descriptional (Kolmogorov) complexity. In other words, DNNs have an inbuilt Occam's razor, a bias towards simple functions. We next show that stochastic gradient descent (SGD), the most popular optimisation method for DNNs, follows the same bias, and so does not itself explain the good generalisation of DNNs. Our approach naturally leads to a marginal-likelihood PAC-Bayes generalisation bound which performs better than any other bounds on the market. Finally, we discuss why this bias towards simplicity allows DNNs to perform so well, and speculate on what this may tell us about the natural world.

In many emerging applications, it is of paramount interest to learn hidden parameters from data. For example, self-driving cars may use onboard cameras to identify pedestrians, highway lanes, or traffic signs in various light and weather conditions. Problems such as these can be framed as classification, regression, or risk minimization in general, at the heart of which lies stochastic optimization and machine learning. In many practical scenarios, distributed and decentralized learning methods are preferable as they benefit from a divide-and-conquer approach towards data at the expense of local (short-range) communication. In this talk, I will present our recent work that develops a novel algorithmic framework to address various aspects of decentralized stochastic first-order optimization methods for non-convex problems. A major focus will be to characterize regimes where decentralized solutions outperform their centralized counterparts and lead to optimal convergence guarantees. Moreover, I will characterize certain desirable attributes of decentralized methods in the context of linear speedup and network independent convergence rates. Throughout the talk, I will demonstrate such key aspects of the proposed methods with the help of provable theoretical results and numerical experiments on real data.

Representation learning has been widely used in many applications. In this talk, I will present our work, which uncovers when and why representation learning provably improves the sample efficiency, from a statistical learning point of view. I will show

the existence of a good representation among all tasks, and

the diversity of tasks are key conditions that permit improved statistical efficiency via multi-task representation learning.

These conditions provably improve the sample efficiency for functions with certain complexity measures as the representation. If time permits, I will also talk about leveraging the theoretical insights to improve practical performance.