-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-distribution generalization #166
Comments
I forgot @user1823 IMO this is a serious problem. "If we don't solve it, we're f###ed" kind of serious. I'd like to have as many people pay attention to it as possible. |
My idea that I shared on Discord: How about this:
The idea is to update parameters harder if there is drift in retention. If the difference between the latest and the previous values of the moving average is large, that means that retention is changing and we need to update parameters really hard. If the difference is small, that means that retention is stable, and we don't need to update too hard. This is a more adaptive alternative to recency weighting. A slightly different approach proposed by 1DWalker:
This is a solution, but it treats the symptoms, not the underlying problem. It would just make FSRS oscillate between being good at one level of retention and being good at another level of retention. What we actually want is for FSRS to be good at ALL levels of retention simultaneously. |
Another idea: we could ask Dae for a new dataset (again), but this time ask him to calculate retention for each user and make sure the dataset has the following properties:
I'm sure among millions of Anki users it's possible to find 5k people with <50% retention. Even better: make a dataset with uniformly distributed retention, such that any value of retention is equally likely to appear in the dataset. Idk if Dae would want to put in the work to do that, and since he's the only one who has access to all that data, nobody can so it for him. |
Anecdotally encounter that issue when I used the "Compute my Optimal Retention", which recommended me to drop from .80 DR to .70. The problem was my 5-day average .80 became .60 very quickly, and on a specific day it dropped as low as ~52%, at this point I backed up to .80 to avoid burying my own grave. My gut feeling was and still is that unfortunately, while FSRS can be good at predicting what it has been trained to predict, the whole Forgetting Curve might not really fit how a prediction for .80 DR can be mapped to a .70 DR. I also think that until this is at least under control, any kind of "Automatic Desired Retention" should be avoided. FSRS is able to predict a certain DR, but is far from being able to predict others. |
I really don't know how to test this but maybe we need to go back to the exponential forgetting curve. Probably, the better metrics that we saw with a power curve were just due to this bias in the evaluation. For those who are unaware, FSRS v3 used an exponential forgetting curve, switched to power curve (with -1 decay) in FSRS v4 and switched to -0.5 decay in FSRS v4.5 Also see: |
That would just make the metrics worse without really telling us much otherwise. I think we need a new dataset as described here: #166 (comment) Then we can test whether the exponential curve is better on that dataset. |
Some time ago I plotted this for FSRS-4.5:
https://i.imgur.com/DSjaW5e.png
https://i.imgur.com/1JsW2Jy.png
(can't upload images directly for some reason)
While the exact numbers will vary depending on the dataset and the outlier filter and whatnot, the trend is clear: lower decay is better.
...as long as retention stays high.
https://discord.com/channels/368267295601983490/1282005522513530952/1332619385440964670
https://discord.com/channels/368267295601983490/1282005522513530952/1332670300013199411
If true R is always >50%, then there is no reason for the algorithm to learn to predict R lower than 50%. So the "optimal" value of decay depends on the training data. And most Anki users have retention around 80-95%, without too many insanely overdue cards. This means that any attempt to fit the curve better to the dataset is bound to cause issues in practice when the algorithm will vastly overestimate R for overdue cards, or worse: for cards that are scheduled with low DR.
If someone had average retention of 95%, then switched to FSRS and is now using desired retention=70%, they are screwed.
More generally speaking, what shape of the curve is optimal depends on retention.
So, what do we do about it?
@L-M-Sherlock @1DWalker
EDIT: to avoid confusion, let me clarify - I believe this is a problem both on the benchmark level and on the level of individual users. On the benchmark level, the "optimal" value of decay depends on the average retention across all collections. On an individual level I suspect that even with a fixed value of decay FSRS still learns to never predict low values of R by predicting very large S. This matches my experience with r/Anki posts, where even people with good metrics, like 2-3% RMSE, report >5% discrepancy between desired retention and true retention. The only explanation is that FSRS doesn't generalize well between different levels of retention.
The text was updated successfully, but these errors were encountered: