I am looking at tf-agents to learn about reinforcement learning. I am following this tutorial. There is a different policy used, called collect_policy for training than for evaluation (policy).
The tutorial states there is a difference, but in IMO it does not describe the why of having 2 policies as it does not describe a functional difference.
Agents contain two policies:
agent.policy — The main policy that is used for evaluation and deployment.
agent.collect_policy — A second policy that is used for data collection.
I've looked at the source code of the agent. It says
policy: An instance of
tf_policy.Baserepresenting the Agent's current policy.collect_policy: An instance of
tf_policy.Baserepresenting the Agent's current data collection policy (used to setself.step_spec).
But I do not see self.step_spec anywhere in the source file. The next closest thing I find is time_step_spec. But that is the first ctor argument of the TFAgent class, so that makes no sense to set via a collect_policy.
So the only thing I can think of was: put it to the test. So I used policy instead of collect_policy for training. And the agent reached the max score in the environment nonetheless.
So what is the functional difference between the two policies?