Abstract
Many Reinforcement Learning methods based on Policy Gradients have been developed over the last decade. Unlike other methods that search for optimal state and action values, from which an optimal policy is derived, Policy Gradient methods optimize a parameterized policy directly. This approach provides significant advantages, such as better convergence guarantees to a locally optimal policy and the ability to handle large or continuous action spaces. Combined with recent deep learning methods, some of these methods enable complicated problems, such as navigation in robotics, to be tackled directly and with no prior knowledge. Theoretically, these algorithms make little to no assumptions about the type of policy parameterization to be used or the specifics of the tasks to be solved, but these variables can significantly influence the results. Thus, we study and empirically compare three selected methods based on Policy Gradients with respect to their performance in different environments, with varying policy architectures and combinations of hyperparameters. Environments are offered by the OpenAI Gym toolkit for reinforcement learning research and the algorithms implemented using the PyTorch framework, allowing easy computation of gradients through the backpropagation algorithm.
Personal appreciation
This was a challenging work for me. While I have an appreciation for the theoretical foundations of the algorithms and methods that I study, I also like to implement those in practice and spend quite a bit of time polishing the implementations to my liking. I feel like there was a conflict between these interests in some moments during the development of this work, with my focus alternating between very theoretical and practical aspects of the methods studied here. This ended up taking precious time that could have been used to structure the experiments and monograph better. Nevertheless, I am very glad to have learned the methods used by organizations such as OpenAI and DeepMind in modern AI. On that note, the monograph can help others learn about the fundamentals of these methods as well. It excites me to know there are more advanced algorithms than the ones studied here and I am eager to test them myself in the near future.