Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: nvidia: ip: adjust benchdnn error threshold #2479

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sgeor255
Copy link
Contributor

Description

Currently the inner product error threshold in benchdnn is set to 0. In some cases for large shapes on nvidia backend there are some precision issues (e.g. the cases reported in MFDNN-12610). This PR adjusts the error threshold so that such cases are not reported as failures.

Fixes MFDNN-12610.

@sgeor255 sgeor255 requested a review from a team as a code owner January 22, 2025 10:46
@github-actions github-actions bot added the component:tests Codeowner: @oneapi-src/onednn-arch label Jan 22, 2025
@sgeor255 sgeor255 force-pushed the svet/nvdia-ip-precision branch from 0f0897a to 383b1a2 Compare January 22, 2025 11:59
cmp.set_threshold(trh);
} else {
cmp.set_threshold(0.f);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this difference ? Is sum post-op applied over f32 intermediate value or over f16 values for NV backend?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this change can fly in only in case sum post-op is done through a native cuDNN fusion (single call) with f16 accumulation internally, otherwise, the issue is likely inside the implementation that doesn't convert the output to f32 and accumulate pieces in f32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum post-op is implemented through the beta parameter of cublas gemm (see here). The compute type for gemm is set to float32, but I couldn't find any details in the documentation how that affects alpha & beta scaling parameters, i.e. whether they are performed in f32 too or in f16 (which is the datatype used for the failing cases).

@dzarukin I investigated if there are any issues with the implementation but couldn't find any. Also, I noticed that changing the input values makes the test pass, e.g. when using whole numbers as the input (still in f16 datatype).

To me it seems to be some sort of a precision/rounding issue. The expected values computed by oneDNN are rounded down, while in the cuDNN case they are rounded up, e.g.

[107536][DST][336:16] exp_f32:      1038.5 exp:        1038 got:        1039 diff:       1 rdiff:0.000963391
[108178][DST][338:18] exp_f32:      1051.5 exp:        1052 got:        1051 diff:       1 rdiff:0.00095057
[108499][DST][339:19] exp_f32:      1064.5 exp:        1064 got:        1065 diff:       1 rdiff:0.00093985

The values in full precision in the above example are not representable as f16 (e.g. https://float.exposed/0x641c), which makes me think cublas is doing incorrect rounding?

Also I found this discussion where someone is asking about how the scaling parameters in cublas work, but there was no response.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgeor255, thanks for looking into implementation details, that's a good start.
I suggest to conduct a little experiment - make cublasGemmEx sum-less and return f32 and append sum post-op (with data copied to a dedicated buffer) with f32 as well (through the cudnnAddTensor call) to simulate the proper behavior.
If this experiment goes fine, it would prove that there's something wrong indeed with cuda libraries, and I'll re-iterate on that check again. Probably the same one should exist (or appear) for matmul as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When changing data addresses the issue it always means rounding/accumulation mechanics stands on its way. Smaller ranges usually lead to situations when final numbers remain exact and conversion to f16/f32 and back don't change the number and the check passes.

When exp number if x.5, in the reality, it can be x.5002, which would be rounded towards x + 1, while the library might have x.4998 which would be rounded towards x. That library result can be a product of different effects (accumulation in lower data types, inexact division/multiplication, intermediate roundings, etc.) which can't be examined with cuda.

@@ -278,7 +278,16 @@ void skip_invalid_prb(const prb_t *prb, res_t *res) {}

void setup_cmp(compare::compare_t &cmp, const prb_t *prb, data_kind_t kind,
const args_t &ref_args) {
cmp.set_threshold(0.f);
// The nvidia implementation has precision issues in some cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call it precision issues, wasn't our conclusion that this is differences in rounding modes that we can't change in cuDNN?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:tests Codeowner: @oneapi-src/onednn-arch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants