Europe/Lisbon —

Yuejie Chi

Yuejie Chi, Carnegie Mellon University

Policy optimization, which learns the policy of interest by maximizing the value function via large-scale optimization techniques, lies at the heart of modern reinforcement learning (RL). In addition to value maximization, other practical considerations arise commonly as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These considerations can often be accounted for by resorting to regularized RL, which augments the target value function with a structure-promoting regularization term, such as Shannon entropy, Tsallis entropy, and log-barrier functions. Focusing on an infinite-horizon discounted Markov decision process, this talk first shows that entropy-regularized natural policy gradient methods converge globally at a linear convergence that is near independent of the dimension of the state-action space. Next, a generalized policy mirror descent algorithm is proposed to accommodate a general class of convex regularizers beyond Shannon entropy. Encouragingly, this general algorithm inherits similar convergence guarantees, even when the regularizer lacks strong convexity and smoothness. Our results accommodate a wide range of learning rates, and shed light upon the role of regularization in enabling fast convergence in RL.