Update documentation for reward scaling wrappers #1285

keraJLi · 2025-01-02T11:54:14Z

Description

Changes the documentation of reward scaling wrappers. It mainly removes incorrect or unsubstantiated information.
Affected wrappers are wrappers/stateful_reward.py and wrappers/vector/stateful_reward.py.

Fixes #1272

Type of change

Please delete options that are not relevant.

Documentation only change (no code changed)

Checklist:

I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
I have made corresponding changes to the documentation
My changes generate no new warnings
New and existing unit tests pass locally with my changes

pseudo-rnd-thoughts

Thanks for the PR @keraJLi, to clarify what do you mean by their exponential moving average?
To me, this isn't clear what the expected mean is or what exactly the rewards are normalised by?

keraJLi · 2025-01-06T20:46:44Z

The rewards are scaled like this:

Starting every episode, we take an EMA of the rewards as $\mu_t = r_t + \gamma \mu_{t-1}$, starting with $\mu_0 = 0$. Note that here $\gamma$ does not correspond to the discount factor. I simply kept the notation of the original paper.
We track the running variance of $\mu$, across episodes. In the limit, it approaches $\mathbb{E}_t[\text{Var}(\mu_t)]$.
We divide rewards by this value.

This means

The reward's expected mean differs between environments and policies since it changes based on the reward variance.
The reward's variance differs between environments and policies. In rare cases, it might even be increased.
The same holds for the mean and variance of the episodic return.

Sadly, you cannot even say the EMA has variance one. This is because rewards are divided by the running variance of the EMA at different time steps.

To me, it seems like you can't really draw any general conclusions about the properties of this method without some assumptions (e.g. if your reward's autocorrelation decays exponentially and the episode length distribution is geometric, you get a nice upper bound on the reward variance).

pseudo-rnd-thoughts · 2025-01-08T14:43:25Z

Thanks for the reply @keraJLi

I think your reword of the docstring makes sense now, I'm happy to merge if you want
Would we want to specify that epsilon is a reward scaling parameter not the discount factor as I think I originally believed?

Looking at the paper again and, in particular, the code implementation, https://openreview.net/pdf?id=r1etN1rtPB#page=11.12
It appears like the paper and Gym/Gymnasium's implementation significantly differ, in particular, that we add the -rs.mean that makes the function look more like a traditional normalising function than the paper's version.
An idea that the paper explicitly says it isn't doing.

Looking at the baseline repo that I suspect might be the first implementation - https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_normalize.py
They don't include the -rs.mean in the step, but in reset, it does confusingly.

Update documentation for reward scaling wrappers

993de0e

keraJLi force-pushed the main branch from 3e6b31e to 993de0e Compare January 2, 2025 11:58

pseudo-rnd-thoughts reviewed Jan 4, 2025

View reviewed changes

pseudo-rnd-thoughts approved these changes Jan 13, 2025

View reviewed changes

pseudo-rnd-thoughts merged commit eaccbb5 into Farama-Foundation:main Jan 13, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update documentation for reward scaling wrappers #1285

Update documentation for reward scaling wrappers #1285

keraJLi commented Jan 2, 2025

pseudo-rnd-thoughts left a comment

keraJLi commented Jan 6, 2025

pseudo-rnd-thoughts commented Jan 8, 2025

Update documentation for reward scaling wrappers #1285

Update documentation for reward scaling wrappers #1285

Conversation

keraJLi commented Jan 2, 2025

Description

Type of change

Checklist:

pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

keraJLi commented Jan 6, 2025

pseudo-rnd-thoughts commented Jan 8, 2025