Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model
Alex X. Lee Anusha Nagabandi Pieter Abbeel Sergey Levine
University of California, Berkeley
Code [GitHub] arXiv [preprint]


Abstract

Deep reinforcement learning (RL) algorithms can use high-capacity deep networks to learn directly from image observations. However, these kinds of observation spaces present a number of challenges in practice, since the policy must now solve two problems: a representation learning problem, and a task learning problem. In this paper, we aim to explicitly learn representations that can accelerate reinforcement learning from images. We propose the stochastic latent actor-critic (SLAC) algorithm: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs. SLAC learns a compact latent representation space using a stochastic sequential latent variable model, and then learns a critic model within this latent space. By learning a critic within a compact state space, SLAC can learn much more efficiently than standard RL methods. The proposed model improves performance substantially over alternative representations as well, such as variational autoencoders. In fact, our experimental evaluation demonstrates that the sample efficiency of our resulting method is comparable to that of model-based RL methods that directly use a similar type of model for control. Furthermore, our method outperforms both model-free and model-based alternatives in terms of final performance and sample efficiency, on a range of difficult image-based control tasks.

Paper

A. X. Lee, A. Nagabandi, P. Abbeel, S. Levine
Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model.
arXiv (preprint).

[Bibtex]


DeepMind Control Suite Results


Example image sequences and samples from the model
Ground Truth Observations
Posterior Samples
Conditional Prior Samples
Prior Samples

OpenAI Gym Results


Example image sequences and samples from the model
Ground Truth Observations
Posterior Samples
Conditional Prior Samples
Prior Samples

Manipulation Results

Example image sequences and samples from the model
Ground Truth Observations
Posterior Samples
Conditional Prior Samples
Prior Samples

Acknowledgments

We thank Marvin Zhang, Abhishek Gupta, and Chelsea Finn for useful discussions and feedback, and we thank Kristian Hartikainen, Danijar Hafner, and Maximilian Igl for providing timely assistance with SAC, PlaNet, and DVRL, respectively. We also thank Deirdre Quillen, Tianhe Yu, and Chelsea Finn for providing us with their suite of Sawyer manipulation tasks. This research was supported by the National Science Foundation through IIS-1651843 and IIS-1700697, as well as ARL DCIST CRA W911NF-17-2-0181 and the Office of Naval Research. Compute support was provided by NVIDIA.