A dynamical clipping strategy with job suggestions for Proximal Coverage Optimization
Authors: Ziqi Zhang, Jingzehua Xu, Zifeng Zhuang, Jinxin Liu, Donglin wang, Shuai Zhang
Summary: Proximal Coverage Optimization (PPO) has been broadly utilized to varied domains, together with Massive Language Mannequin (LLM) optimization and Robotics studying, and many others. Nevertheless, PPO is restricted by a hard and fast setting for the clipping sure. Particularly, there isn’t any theoretical proof that the optimum clipping sure stays constant all through your complete coaching course of. Truncating the ratio of the brand new and previous insurance policies with a singular clipping sure ensures steady coaching and might obtain one of the best coaching efficiency. Moreover, earlier analysis suggests {that a} fastened clipping sure limits the agent’s exploration. Subsequently, researching a dynamical clipping sure to boost PPO’s efficiency will be extremely useful. Completely different from earlier clipping approaches, we think about rising the utmost cumulative Return in reinforcement studying (RL) duties because the choice of the RL job, and suggest a bi-level proximal coverage optimization paradigm, which includes not solely optimizing the coverage but in addition dynamically adjusting the clipping sure to replicate the choice of the RL duties to additional elevate the coaching outcomes and stability of PPO. Primarily based on this bi-level proximal coverage optimization paradigm, we introduce a brand new algorithm named Choice primarily based Proximal Coverage Optimization (Pb-PPO). This algorithm makes use of a multi-armed bandit algorithm to replicate RL preferences (we additionally validate that such strategy will be utilized to replicate human choice), recommending the optimum clipping sure for PPO in every epoch, thereby attaining extra steady and higher coaching outcomes.