Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-distribution generalization #166

Open
Expertium opened this issue Jan 25, 2025 · 6 comments
Open

Out-of-distribution generalization #166

Expertium opened this issue Jan 25, 2025 · 6 comments

Comments

@Expertium
Copy link
Contributor

Expertium commented Jan 25, 2025

Some time ago I plotted this for FSRS-4.5:

https://i.imgur.com/DSjaW5e.png

https://i.imgur.com/1JsW2Jy.png

(can't upload images directly for some reason)

While the exact numbers will vary depending on the dataset and the outlier filter and whatnot, the trend is clear: lower decay is better.

...as long as retention stays high.

https://discord.com/channels/368267295601983490/1282005522513530952/1332619385440964670

If I use the simulated data to optimize FSRS with trainable forgetting decay, the decay will never approach 0.5 which is the default value of forgetting decay and used to generate the simulated data.
The optimizer prefers flat forgetting curves and I don't know the reason.

https://discord.com/channels/368267295601983490/1282005522513530952/1332670300013199411

New Finding: the decay increases when I decreases the desired retention of the simulation config.
So the decay is highly relevant to retention of training data.

If true R is always >50%, then there is no reason for the algorithm to learn to predict R lower than 50%. So the "optimal" value of decay depends on the training data. And most Anki users have retention around 80-95%, without too many insanely overdue cards. This means that any attempt to fit the curve better to the dataset is bound to cause issues in practice when the algorithm will vastly overestimate R for overdue cards, or worse: for cards that are scheduled with low DR.

If someone had average retention of 95%, then switched to FSRS and is now using desired retention=70%, they are screwed.

More generally speaking, what shape of the curve is optimal depends on retention.

So, what do we do about it?

@L-M-Sherlock @1DWalker

EDIT: to avoid confusion, let me clarify - I believe this is a problem both on the benchmark level and on the level of individual users. On the benchmark level, the "optimal" value of decay depends on the average retention across all collections. On an individual level I suspect that even with a fixed value of decay FSRS still learns to never predict low values of R by predicting very large S. This matches my experience with r/Anki posts, where even people with good metrics, like 2-3% RMSE, report >5% discrepancy between desired retention and true retention. The only explanation is that FSRS doesn't generalize well between different levels of retention.

@Expertium
Copy link
Contributor Author

I forgot @user1823

IMO this is a serious problem. "If we don't solve it, we're f###ed" kind of serious. I'd like to have as many people pay attention to it as possible.

@Expertium
Copy link
Contributor Author

Expertium commented Jan 26, 2025

My idea that I shared on Discord:

How about this:

  1. Calculate retention within a given batch, let's call it retention_batch.
  2. Use an exponential moving average, like this: retention_ma = retention_batch * a + retention_ma * (1-a), where retention_ma is the moving average that gets updated after every batch. For example, if retention in a new batch was 0.8, retention_ma was 0.9, and a = 0.5, new value of retention_ma would be 0.85. a is a hyper-parameter that needs to be fine-tuned.
  3. Adjust the loss like this: loss = loss * abs(retention_ma_new - retention_ma_prev) * b, where retention_ma_new is the new value of the moving average after it was updated on the current batch, and retention_ma_prev is the previous value of the moving average. b is a hyperparameter that needs to be fine-tuned. If necessary, the abs() term can be clamped to some non-zero value, to ensure that the loss is never too close to 0.

The idea is to update parameters harder if there is drift in retention. If the difference between the latest and the previous values of the moving average is large, that means that retention is changing and we need to update parameters really hard. If the difference is small, that means that retention is stable, and we don't need to update too hard.

This is a more adaptive alternative to recency weighting.

A slightly different approach proposed by 1DWalker:

Take v for velocity, update it using exponential moving average based on abs(retention_ma_new - retention_ma_prev)^t. Then we have loss = loss * (1 + sv) or something
s is a scaling factor

This is a solution, but it treats the symptoms, not the underlying problem. It would just make FSRS oscillate between being good at one level of retention and being good at another level of retention. What we actually want is for FSRS to be good at ALL levels of retention simultaneously.

@Expertium
Copy link
Contributor Author

Expertium commented Jan 26, 2025

Another idea: we could ask Dae for a new dataset (again), but this time ask him to calculate retention for each user and make sure the dataset has the following properties:

  1. The average retention across all users is roughly 50%
  2. The number of users with retention higher than 50% is roughly the same as the number of users with retention lower than that

I'm sure among millions of Anki users it's possible to find 5k people with <50% retention.
Then we will be able to fine-tune the shape of the forgetting curve without having to worry about overfitting to people with high retention. We would still have to worry about lack of generalziation on the individual level, but at least this would be a step forward.

Even better: make a dataset with uniformly distributed retention, such that any value of retention is equally likely to appear in the dataset. Idk if Dae would want to put in the work to do that, and since he's the only one who has access to all that data, nobody can so it for him.

@JSchoreels
Copy link

JSchoreels commented Jan 26, 2025

Anecdotally encounter that issue when I used the "Compute my Optimal Retention", which recommended me to drop from .80 DR to .70. The problem was my 5-day average .80 became .60 very quickly, and on a specific day it dropped as low as ~52%, at this point I backed up to .80 to avoid burying my own grave.

My gut feeling was and still is that unfortunately, while FSRS can be good at predicting what it has been trained to predict, the whole Forgetting Curve might not really fit how a prediction for .80 DR can be mapped to a .70 DR.

I also think that until this is at least under control, any kind of "Automatic Desired Retention" should be avoided. FSRS is able to predict a certain DR, but is far from being able to predict others.

@user1823
Copy link
Contributor

I really don't know how to test this but maybe we need to go back to the exponential forgetting curve. Probably, the better metrics that we saw with a power curve were just due to this bias in the evaluation.

For those who are unaware, FSRS v3 used an exponential forgetting curve, switched to power curve (with -1 decay) in FSRS v4 and switched to -0.5 decay in FSRS v4.5

Also see:

@Expertium
Copy link
Contributor Author

Expertium commented Jan 26, 2025

That would just make the metrics worse without really telling us much otherwise. I think we need a new dataset as described here: #166 (comment)

Then we can test whether the exponential curve is better on that dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants