Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation for reward scaling wrappers #1285

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

keraJLi
Copy link
Contributor

@keraJLi keraJLi commented Jan 2, 2025

Description

Changes the documentation of reward scaling wrappers. It mainly removes incorrect or unsubstantiated information.
Affected wrappers are wrappers/stateful_reward.py and wrappers/vector/stateful_reward.py.

Fixes #1272

Type of change

Please delete options that are not relevant.

  • Documentation only change (no code changed)

Checklist:

  • I have run the pre-commit checks with pre-commit run --all-files (see CONTRIBUTING.md instructions to set it up)
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

Copy link
Member

@pseudo-rnd-thoughts pseudo-rnd-thoughts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @keraJLi, to clarify what do you mean by their exponential moving average?
To me, this isn't clear what the expected mean is or what exactly the rewards are normalised by?

@keraJLi
Copy link
Contributor Author

keraJLi commented Jan 6, 2025

The rewards are scaled like this:

  1. Starting every episode, we take an EMA of the rewards as $\mu_t = r_t + \gamma \mu_{t-1}$, starting with $\mu_0 = 0$. Note that here $\gamma$ does not correspond to the discount factor. I simply kept the notation of the original paper.
  2. We track the running variance of $\mu$, across episodes. In the limit, it approaches $\mathbb{E}_t[\text{Var}(\mu_t)]$.
  3. We divide rewards by this value.

This means

  • The reward's expected mean differs between environments and policies since it changes based on the reward variance.
  • The reward's variance differs between environments and policies. In rare cases, it might even be increased.
  • The same holds for the mean and variance of the episodic return.

Sadly, you cannot even say the EMA has variance one. This is because rewards are divided by the running variance of the EMA at different time steps.

To me, it seems like you can't really draw any general conclusions about the properties of this method without some assumptions (e.g. if your reward's autocorrelation decays exponentially and the episode length distribution is geometric, you get a nice upper bound on the reward variance).

@pseudo-rnd-thoughts
Copy link
Member

Thanks for the reply @keraJLi

I think your reword of the docstring makes sense now, I'm happy to merge if you want
Would we want to specify that epsilon is a reward scaling parameter not the discount factor as I think I originally believed?

Looking at the paper again and, in particular, the code implementation, https://openreview.net/pdf?id=r1etN1rtPB#page=11.12
It appears like the paper and Gym/Gymnasium's implementation significantly differ, in particular, that we add the -rs.mean that makes the function look more like a traditional normalising function than the paper's version.
An idea that the paper explicitly says it isn't doing.

Looking at the baseline repo that I suspect might be the first implementation - https://github.com/openai/baselines/blob/master/baselines/common/vec_env/vec_normalize.py
They don't include the -rs.mean in the step, but in reset, it does confusingly.

@pseudo-rnd-thoughts pseudo-rnd-thoughts merged commit eaccbb5 into Farama-Foundation:main Jan 13, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Proposal/Question] Incorrect documentation of NormalizeReward wrapper
2 participants