Proximal Policy Optimization with Expert Trajectories (PPO) is a form of imitation fine tuning. Imitation fine tuning is a reinforcement learning technique where a pre-trained agent learns to imitate the behavior of an expert through the use of expert trajectories or demonstrations. PPO specifically utilizes expert trajectories to fine-tune the agent’s policy, allowing it to mimic the expert’s actions and improve its performance on a particular task.
In PPO, you typically freeze some of the lower layers of the model, which contain more general language understanding, and then fine-tune the upper layers to be more task-specific. This way, you benefit from the pre-trained language model’s understanding of language while making it more specialized for the particular task at hand.
So, in summary, PPO is not a form of full fine tuning but rather a method that involves updating a subset of the model’s weights to adapt it to a specific task while retaining the knowledge gained during pre-training.
See Also: PETM, Prompt Tuning