__Off-policy__ algorithms like DQN, DDPG, and SAC are designed to utilize a data buffer (known as replay buffer) of interactions. However, there are still limitations when it comes to only learning from a fixed dataset. One of the problem being, there is no possibility of improving exploration: exploration is outside the scope of the algorithm, so if the dataset does not contain transitions that illustrate high-reward regions of the state space, it may be impossible to discover those high-reward regions. Another problem is distributional shift: while a function approximator (policy, value function, or model) might be trained under one distribution, it will be evaluated on a different distribution, due both to the change in visited states for the new policy and, more subtly, by the act of maximizing the expected return. Once the policy enters one of out-of-distribution states, it will keep making mistakes and may remain out-of-distribution for the remainder of the trial.
0 commit comments