Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GRU-P-secs to the table #149

Closed
1DWalker opened this issue Jan 7, 2025 · 24 comments
Closed

Add GRU-P-secs to the table #149

1DWalker opened this issue Jan 7, 2025 · 24 comments

Comments

@1DWalker
Copy link
Contributor

1DWalker commented Jan 7, 2025

It would be interesting to see how exactly -secs can benefit a model like GRU-P. With this information and combining with -short, it can be inferred how much potential there is for fsrs to incorporate same-day reviews with proper second-resolution data.

-secs is incomparable with non -secs due how the preprocessing is done, resulting in a different number of reviews that are considered. -secs also considers 1 more user than non -secs. A possible solution would be to evaluate -secs only on reviews that would be evaluated with a non -secs model.

`Model: GRU-P-short
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
GRU-P-short LogLoss (mean±std): 0.3195±0.1469
GRU-P-short RMSE(bins) (mean±std): 0.0421±0.0288
GRU-P-short AUC (mean±std): 0.7096±0.0815

Model: GRU-P-secs
Total number of users: 10000
Total number of reviews: 519296315
Weighted average by reviews:
GRU-P-secs LogLoss (mean±std): 0.3293±0.1430
GRU-P-secs RMSE(bins) (mean±std): 0.0495±0.0266
GRU-P-secs AUC (mean±std): 0.7409±0.0744

@Expertium
Copy link
Contributor

It doesn't perform better, as you can tell by logloss and RMSE (AUC is a bit better, though). In fact, you can try this with any other model and you will see that -secs models have worse metrics.
Considering that even GRU-P-secs doesn't outperform it's non-seqcs variant despite being very flexible, it seems like same-day reviews just suck.

@1DWalker
Copy link
Contributor Author

1DWalker commented Jan 7, 2025

I don't think that they are directly comparable since the review counts don't even match up. In theory, GRU-P-short uses short-term data but GRU-P-secs does not have access to this short-term data. As you have suggested elsewhere, for review steps of >= 1 days secs is useless, but it can be argued that when -secs and -short are combined the benefit will be greater than with -short alone. From my current understanding GRU-P-short is just given a tensor indicating short term data with no time information, such as [0, 3, 3, 0, 2, 3] which are just what the user presses on Anki. No information whatsoever is provided as to the time of this information.

@Expertium
Copy link
Contributor

@L-M-Sherlock how about a -shortsecs (or whatever you wanna call it) model?
-short:

  • Doesn't predict the probability of recall for same-day reviews
  • Doesn't have access to real interval lengths for same-day reviews

-secs:

  • Does predict the probability of recall for same-day reviews
  • Does have access to real interval lengths for same-day reviews

-shortsecs:

  • Doesn't predict the probability of recall (like -short)
  • Does have access to real interval lengths (like -secs)

If you do that, I'll work on the formulas. We'll, provided I figure out how to run the damn thing (I hate the benchmark code so much it's unreal)

@L-M-Sherlock
Copy link
Member

@Expertium
Copy link
Contributor

That's basically what I'm asking for, right? Don't measure the accuracy of predicted R for same-day reviews while still using real interval lengths, right?

@Expertium
Copy link
Contributor

@L-M-Sherlock am I right? (see above)

@L-M-Sherlock
Copy link
Member

Yep.

@Expertium
Copy link
Contributor

Expertium commented Jan 10, 2025

FSRS-5 --secs --no_test_same_day is much worse than just FSRS-5. I don't know why
image

I was expecting them to be roughly the same, since it's basically just FSRS-5: same formulas, nothing new, just fractional intervals. Also, the number of reviews isn't exactly the same.
@L-M-Sherlock any idea what might be going on?
Also, I'll run FSRS-5 --secs to see how much it differs from FSRS-5 --secs --no_test_same_day

@L-M-Sherlock
Copy link
Member

any idea what might be going on?

Obviously, FSRS-5-secs improves its ability to predict the short-term memory at the cost of the ability to predict the long-term memory.

@Expertium
Copy link
Contributor

But it's the same formulas. I didn't change the math. I don't understand why there is such a big difference.

@L-M-Sherlock
Copy link
Member

Oops, I get it. You need replace "delta_t" with "elapsed_days" here:

srs-benchmark/other.py

Lines 2498 to 2500 in e1a5ba4

if SHORT_TERM:
df = df[(df["delta_t"] != 0) | (df["i"] == 1)].copy()
df["i"] = df.groupby("card_id").cumcount() + 1

Then FSRS-5 will not predict the probability of recall for same-day reviews during traning.

@1DWalker
Copy link
Contributor Author

The issue still exists. With your suggested change, df will only contain values with elapsed_days > 0 which makes it so that --no_test_same_day does nothing.

srs-benchmark/other.py

Lines 2591 to 2592 in e1a5ba4

if NO_TEST_SAME_DAY:
test_set = test_set[test_set["elapsed_days"] > 0].copy()

I think it may have to do with this filtering step that I don't understand as this reduces the size of the dataframe

srs-benchmark/other.py

Lines 2501 to 2513 in e1a5ba4

if not SECS_IVL:
filtered_dataset = (
df[df["i"] == 2]
.groupby(by=["first_rating"], as_index=False, group_keys=False)[df.columns]
.apply(remove_outliers)
)
if filtered_dataset.empty:
return pd.DataFrame()
df[df["i"] == 2] = filtered_dataset
df.dropna(inplace=True)
df = df.groupby("card_id", as_index=False, group_keys=False)[df.columns].apply(
remove_non_continuous_rows
)

@Expertium
Copy link
Contributor

@L-M-Sherlock here's what I recommend:

  1. Make it so that --secs only affects intervals (integer vs float) and NOTHING else
  2. Rename no_test_same_day to predict_same_day_r and make it so that if it's enabled, predicted R for same-day reviews is used both during training and evaluation. If disabled, predicted R for same-day reviews is not used. Neither for training, nor for evaluation.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 13, 2025

  • Make it so that --secs only affects intervals (integer vs float) and NOTHING else

It's impossible to apply the current remove_outliers to --secs case. So I don't know how to make it only affects intervals and NOTHING else, unless remove_outliers is removed from all benchmark.

@L-M-Sherlock
Copy link
Member

any idea what might be going on?

Another reason: FSRS treats reviews with elapsed days < 1 as short-term reviews. When the intervals are float, they are not rounded, so some long-term reviews (the elapsed days are shorter than 1 day but across a sleep time) are treated as short-term reviews.

@L-M-Sherlock
Copy link
Member

After 09d5c4a, the gap between FSRS-5-secs and FSRS-5 is not that huge:

Model: FSRS-5-secs
Total number of users: 1990
Total number of reviews: 69187535
Weighted average by reviews:
FSRS-5-secs LogLoss (mean±std): 0.3537±0.1622
FSRS-5-secs RMSE(bins) (mean±std): 0.0686±0.0393
FSRS-5-secs AUC (mean±std): 0.6968±0.0785

Weighted average by log(reviews):
FSRS-5-secs LogLoss (mean±std): 0.3760±0.1781
FSRS-5-secs RMSE(bins) (mean±std): 0.0887±0.0523
FSRS-5-secs AUC (mean±std): 0.6940±0.0864

Weighted average by users:
FSRS-5-secs LogLoss (mean±std): 0.3790±0.1809
FSRS-5-secs RMSE(bins) (mean±std): 0.0918±0.0539
FSRS-5-secs AUC (mean±std): 0.6936±0.0885

parameters: [0.43215, 1.36565, 3.37595, 19.0261, 7.15305, 0.5207, 1.71375, 0.01115, 1.5362, 0.1723, 1.01335, 1.9256, 0.10845, 0.28325, 2.27665, 0.2308, 2.9898, 0.4275, 0.8216]

Model: FSRS-5
Total number of users: 1990
Total number of reviews: 66177965
Weighted average by reviews:
FSRS-5 LogLoss (mean±std): 0.3415±0.1558
FSRS-5 RMSE(bins) (mean±std): 0.0546±0.0334
FSRS-5 AUC (mean±std): 0.7047±0.0813

Weighted average by log(reviews):
FSRS-5 LogLoss (mean±std): 0.3590±0.1681
FSRS-5 RMSE(bins) (mean±std): 0.0712±0.0445
FSRS-5 AUC (mean±std): 0.7011±0.0878

Weighted average by users:
FSRS-5 LogLoss (mean±std): 0.3614±0.1704
FSRS-5 RMSE(bins) (mean±std): 0.0737±0.0459
FSRS-5 AUC (mean±std): 0.7005±0.0898

parameters: [0.4208, 1.1358, 3.00825, 15.49955, 7.1794, 0.54405, 1.7145, 0.00635, 1.51585, 0.1256, 1.00245, 1.9359, 0.1069, 0.2932, 2.27565, 0.23055, 2.9898, 0.4591, 0.63255]

@Expertium
Copy link
Contributor

Expertium commented Jan 13, 2025

I still think that it would be best to make the number of reviews exactly equal. But ok, I'll see if I can improve short-term S formulas now. So do I need --secs --no_test_same_day?

@1DWalker
Copy link
Contributor Author

The -secs and non -secs use a different number of reviews so it's an apples to oranges comparison. I would rather go down the route of not removing outliers so that the reviews that are compared are actually the same and so that we properly benchmark algorithms that use all the available short-term information.

@Expertium
Copy link
Contributor

@L-M-Sherlock I agree with 1DWalker. Let's remove the outlier filter if that's the case. I really want to try FSRS that uses fractional interval lengths but doesn't predict R for same-day reviews, neither in training nor in evaluation.

@1DWalker
Copy link
Contributor Author

Each review can already be uniquely identified. Maybe we can just mark which reviews would be removed by the filter and exclude them when testing, similar to what was attempted for --no_test_same_day?

@L-M-Sherlock
Copy link
Member

I prefer to remove the filter entirely. It's more convenient to me. If you have another solution, PR is welcome.

@1DWalker
Copy link
Contributor Author

Removing the filter would mean we would need to rebuild the entire benchmark. I'll work on the issue.

@1DWalker
Copy link
Contributor Author

There is still a bug with --secs where rmse is not properly calculated due to the rmse calculation being dependent on delta_t which is modified by --secs.
https://github.com/open-spaced-repetition/fsrs-optimizer/blob/main/src/fsrs_optimizer/fsrs_optimizer.py#L2156

As an example I made a model that only predicts a constant 90% retention.

Normal:
{"metrics": {"RMSE": 0.487103, "LogLoss": 0.729563, "RMSE(bins)": 0.286898, "ICI": 0.1, "AUC": 0.5}, "user": 1, "size": 10620}
{"metrics": {"RMSE": 0.38849, "LogLoss": 0.492415, "RMSE(bins)": 0.125703, "ICI": 0.1, "AUC": 0.5}, "user": 2, "size": 35900}
{"metrics": {"RMSE": 0.269809, "LogLoss": 0.277834, "RMSE(bins)": 0.060215, "ICI": 0.1, "AUC": 0.5}, "user": 3, "size": 4255}
{"metrics": {"RMSE": 0.374166, "LogLoss": 0.462409, "RMSE(bins)": 0.131994, "ICI": 0.1, "AUC": 0.5}, "user": 4, "size": 7360}
{"metrics": {"RMSE": 0.423296, "LogLoss": 0.570017, "RMSE(bins)": 0.146756, "ICI": 0.1, "AUC": 0.5}, "user": 5, "size": 6275}

--secs --equalize_test_with_non_secs:
{"metrics": {"RMSE": 0.487103, "LogLoss": 0.729563, "RMSE(bins)": 0.286324, "ICI": 0.1, "AUC": 0.5}, "user": 1, "size": 10620}
{"metrics": {"RMSE": 0.38849, "LogLoss": 0.492415, "RMSE(bins)": 0.133124, "ICI": 0.1, "AUC": 0.5}, "user": 2, "size": 35900}
{"metrics": {"RMSE": 0.269809, "LogLoss": 0.277834, "RMSE(bins)": 0.067849, "ICI": 0.1, "AUC": 0.5}, "user": 3, "size": 4255}
{"metrics": {"RMSE": 0.374166, "LogLoss": 0.462409, "RMSE(bins)": 0.143521, "ICI": 0.1, "AUC": 0.5}, "user": 4, "size": 7360}
{"metrics": {"RMSE": 0.423296, "LogLoss": 0.570017, "RMSE(bins)": 0.17402, "ICI": 0.1, "AUC": 0.5}, "user": 5, "size": 6275}

See how while the LogLoss are exactly the same, RMSE(bins) are different. I'll work on a fix.

@1DWalker
Copy link
Contributor Author

With LSTM being merged there exists a --secs model on the benchmark, albeit one that doesn't use same-day testing, I think this issue is not important anymore; we now have many flags for doing short-term training, testing, fractional intervals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants