Blog
PFRL baselines for the MineRL Competition
Shinya Shiroshita
This article introduces one utilization of PFRL − the reinforcement learning library Preferred Networks, Inc. (PFN) has been developing − through our activities in the 2019 and 2020 NeurIPS competition “The MineRL Competition for Sample-Efficient Reinforcement Learning.”
About
This year’s introduction movie
Minecraft is the top selling video game of all time, and MineRL is a competition in the reinforcement learning environment based on it. The objective of the competition is “to develop a sample-efficient reinforcement learning algorithm.” Participants compete on how well their agents perform from a limited number of interactions with the environment.
The basis of the competition is to “Obtain a diamond,” which contains the following challenges:
- To obtain a diamond, an agent needs to achieve several intermediate goals (e.g., get logs, craft an iron pickaxe, etc.). Hierarchical planning may alleviate the difficulty.
- Rewards are given when an agent gets certain items. However, simple random actions seldom earn rewards, so you need to consider learning methods or data utilization.
- The MineRL environment generates the world with different seeds for each episode, which modifies surrounding objects. For example, a forest biome has many trees, while a desert biome consists of huge amounts of sand blocks. Therefore, an agent needs to learn a generalized behavior applicable to various situations.
Moreover, it is prohibited to encode domain knowledge from humans, such as “you can get a log by doing commands A, B, and C in order.” Agents must learn behaviors only from getting rewards while playing the game itself or the human demonstrations.
The characteristics of this competition are as follows:
- Your agent can interact with the environment at most 8,000,000 times during training.
- You can utilize human demonstrations consisting of more than 60 million state-action pairs.
- You can train an agent in subtasks other than ObtainDiamond, such as collecting 64 log blocks in a forest (Treechop), and going to a specified position (Navigate).
Here’re an overview and an analysis of the 2019 competition from the following pages:
- Overview: https://arxiv.org/abs/1904.10079
- Analysis: https://arxiv.org/abs/2003.05012
Increased difficulty for the 2020 competition
In the 2020 competition, non-camera observations (mainly the agent’s inventory) and actions are obfuscated. This prevents competitors from seeding their agents with human knowledge of
- which items they have and which items to collect, and
- the meaning of each action.
Since agents must learn them from the data or the interactions, the task is more challenging than the last year.
The 2020 competition is held in the following link:
PFN’s activities
As one of the organizers, PFN has prepared baseline algorithms that help participants quickly get accustomed to the environment.
This year, we have utilized our reinforcement learning library PFRL, running on the PyTorch framework. PFRL inherits the tradition of its predecessor ChainerRL in which it reproduces various state-of-the-art algorithms and techniques with special care taken to the use of experimental parameters in the original paper.
You can find the detailed introduction of PFRL in this article:
This year’s baseline consists of the following four algorithms:
- Rainbow
- Soft Q Imitation Learning (SQIL)
- Deep Q-learning from Demonstrations (DQfD)
- Prioritized Double Dueling DQN (PDDDQN)
All of these implementations utilize actions discretized by the K-Means method. This method creates a certain number of clusters by grouping similar data points (actions) and then returns a set of actions representing each cluster.
We will briefly describe the details of each baseline implementation here.
Rainbow
Rainbow is a kind of DQN-based algorithm which applies the following improvements over DQN:
- Double Q-learning: Reduce value estimation bias by using two networks
- Prioritized experience replay: Prioritize samples with larger TD errors
- Dueling networks: Separate Q values into values and action advantages
- Multi-step loss function: Update Q function with n step returns
- Distributional RL: Learn a distribution of Q values instead of a single expected value
- Noisy Nets: Apply a noise function conditioned on the current observation for exploration
This implementation uses the existing algorithm, but transforms the state-action-space (according to the new rules).
- State space is limited to pov, i.e., point of view of the agent. In this setting, agents are not able to obtain hints for the goal since the inventory information is not available. On the other hand, this strict limitation makes agents focus on learning the basic task, “chopping tree,” on MineRLObtainDiamondDenseVectorObf-v0 environment which yields rewards for every log obtainment. We hope participants can get a cue to start the competition by improving Rainbow baseline which learns a very basic policy, or comparing their agents against it.
- As described above, agents are trained on the action space clustered by the human demonstration dataset’s action. A typical exploration scheme for most RLs, including Rainbow, is “select action randomly,” but it doesn’t work on sparse space with unknown encodings like in this competition.
Baseline URL: https://github.com/keisuke-nakata/minerl2020_submission
SQIL
Soft Q Imitation Learning (SQIL) is an imitation learning algorithm based on Soft Q-learning. It maintains both demonstrations and experiences in its replay buffer, keeping a 50% population of each. It combines demonstration frames with the reward “1” and experience frames with the reward “0”.
This baseline includes the following improvements:
- Since it is difficult to collect enough frames for the initial training phase with uniform sampling, we controlled the sampling proportion so that both samples and experiences have similar cumulative reward distributions. We divided each demonstration episode into ten subtasks by cumulative rewards, whose boundaries are calculated so that each subtask has a similar amount of demonstration frames.
- When we created an action list, we categorized the whole actions into two subgroups (actions modifying `observation$vector` and actions not modifying it) and applied K-Means independently. With this conversion, we can gather one-frame actions appearing only a few times in an episode, such as “craft” and “smelt,” which are difficult to be sampled from a uniform sampler without manual weight control.
- We set the demonstration’s rewards to ten to balance the exploration probability.
Baseline URL: https://github.com/s-shiroshita/minerl2020_sqil_submission
DQfD
Deep Q-learning from Demonstrations (DQfD) is a modification of DQN, which learns not only from experiences but also from demonstrations.
As an initial step, we worked on reproducing the results in the Atari benchmark, which is not entirely possible since the original demonstrations are not available, but can be done to a certain extent using other demonstration datasets such as Atari-HEAD. Additionally, we contacted the authors to confirm implementation details such as the replay buffer, which contains a mix of 1-step and n-step transitions, and the optimization hyperparameters.
An improvement added to the baseline was using NoisyNets, which helped exploration for Treechop, but also improved the results of pre-training in ObtainDiamond.
A possible improvement to this baseline is to increase the maximum number of demonstrations per agent, which is currently set to 64. To do this, it may be necessary to increase the number of pre-training steps, or reduce the priority given to them once the agent starts to interact with the environment.
Baseline URL: https://github.com/marioyc/minerl2020_dqfd_submission
PDDDQN
Prioritized Double Dueling DQN (PDDDQN) is a DQN variant with prioritized sampling of experience during training, the separation of the Q-function into a value function and action advantages (dueling), and the double-DQN-style updates.
This baseline uses the same state and action-space modifications described in the Rainbow baseline. While we found the default parameters of the original algorithm to be an adequate baseline, we noticed a significant improvement can be done by using a 10-step return, which we attribute to the relatively long time-horizon of the task.
We tried the continuous action space (i.e. without clustering them) through the use of Normalized Advantage Functions (NAF) but were unable to get better results despite adding heuristics for exploration.
Baseline URL: https://github.com/ummavi/minerl2020_submission
To the challenge!
We enjoyed leveraging the PFRL algorithms to apply them to the MineRL Competition and hope you find them useful. We will continue to enrich PFRL’s functions and algorithms.
We hope you find the PFRL algorithms useful in your next reinforcement learning research projects and competitions!