Extend and fix reward/return normalizations · thu-ml/tianshou#927

(0 评论) (0 反应) (0 负责人)Python (1,072 fork)batch import

enhancementgood first issue

仓库指标

Star: (7,121 star)
PR 合并指标: (30 天内没有已合并 PR)

描述

There are some ways in which reward/return/value normalization could be improved. But one most drastic thing is the following:

Currently PGPolicy instantiates self.ret_rms = RunningMeanStd(), and RunningMeanStd has a default value of clip_max=10. This cannot be adjusted by users! (except through monkey-patching, ofc)

This might work well for some standard envs, but the clipping value is arbitrary and making it non-configurable is a major hinderance for users, who are most probably not aware of this.

Generally, how to best normalize stuff in RL is an active discussion and normalization can play an important role in performance. I believe tianshou should be extended to accomodate various normalization strategies

贡献者指南

研究方向: 调查当前RunningMeanStd类及其clip max用法。考虑将clip max作为PGPolicy和其他策略的参数。同时探索其他归一化技术，如PopArt或自适应归一化。
技术栈: python
领域: backend
议题类型: 功能
难度: 2
预计时间: 1-3 小时
活动状态: 新近可参与
清晰度: 清晰
前置要求: PythonPyTorchGit
新手友好度: 70

仓库指标

描述

贡献者指南

每天在邮箱收到新鲜 Easy issues。