Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change how diagnostics are ordered #4778

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

jonmeow
Copy link
Contributor

@jonmeow jonmeow commented Jan 9, 2025

This change deliberately breaks away from the line/column ordering, and instead focuses on a last byte offset corresponding to the final token processed as part of producing the message. Where that's equal, this maintains stable ordering in order to reflect the order that diagnostics were produced.

The intent of this approach is that lex, parse, and check diagnostics are interleaved based on where they are produced, but that subexpressions still have diagnostics emitted prior to containing expressions. In particular, the prior line/column sort essentially sorted on the start of where a diagnostic was associated, and this is closer to sorting based on the end. As a consequence, something like F(1 2) will have the error for 1 2 emitted before a diagnostic for F(1 2) not matching parameters, instead of after.

In check, we track the last handled node. This provides a last_byte_offset separate from where a diagnostic is associated. The intent is that this creates an ordering of diagnostics which may be associated with earlier code, to cause the diagnostics to be emitted later. An example consequence of this is the change in ordering of modifier diagnostics: we are diagnosing those from the same place, but they have the same last_byte_offset, so we print them out in the order produced.

I've added similar tracking to parse, but cannot identify any test which is affected by it (note the separate commit, I thought about this late). I'm not sure whether we have good out-of-order errors we could produce for this.

A significant number of tests have reordered diagnostics as a consequence of this change, so this change does not add further testing.

@github-actions github-actions bot requested a review from zygoloid January 9, 2025 00:06
@@ -396,7 +401,7 @@ class Context {
llvm::raw_ostream* vlog_stream_;

// The current position within the token buffer.
Lex::TokenIterator position_;
Lex::TokenIterator* const position_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds extra indirection to a lot of accesses to position_. Do we have an easy way to benchmark to see if this makes any difference to performance? (I think we could alternatively make Parse ask the Context for the address of its position and pass that into the converter, or we could make the Context own the converter, tracker, and emitter.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check with compile_benchmark, but not bothering since I think it's clearer to make Context the owner.

@@ -9,7 +9,21 @@ namespace Carbon::Check {

auto SemIRDiagnosticConverter::ConvertLoc(SemIRLoc loc,
ContextFnT context_fn) const
-> DiagnosticLoc {
-> std::pair<DiagnosticLoc, int32_t> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered adding the offset as a field in DiagnosticLoc? I'm not sure whether that'd be reasonable or not, so if this doesn't work out for some reason I think it's fine as-is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A given Diagnostic can have multiple DiagnosticMessages, each with their own DiagnosticLoc. The DiagnosticLoc will often point at a line in a different file as compared to the file the last_byte_offset points into. I believe putting the last_byte_offset on the Diagnostic avoids the potential for confusion about what a last_byte_offset on a note in a different file means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also wondering the same, since they are created and travel together so much. Might it help if the DiagnosticLoc field was an "ordering key" instead of an explicit "offset" which may be from a different file?

Copy link
Contributor Author

@jonmeow jonmeow Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a diagnostic message such as:

  // CHECK:STDERR: fail_private_field_access.carbon:[[@LINE+7]]:21: error: cannot access private member `radius` of type `Circle` [ClassInvalidMemberAccess]
  // CHECK:STDERR:   let radius: i32 = circle.radius;
  // CHECK:STDERR:                     ^~~~~~~~~~~~~

The portions provided by DiagnosticLoc are everything that relates to location:

  • filename (fail_private_field_access.carbon)
  • line number ([[@LINE+7]])
  • column number (21)
  • line (let radius: i32 = circle.radius;)
  • length (^~~~~~~~~~~~~)

I think I have a different perspective here. It's more than an "ordering key". We have a few functions returning DiagnosticLoc + last_byte_offset, but the code with DiagnosticLoc only I am not touching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, if there's a strong preference, I can add a wrapper struct instead of using std::pair; but I do think DiagnosticLoc should remain separate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, one more note. DiagnosticLoc is part of the intended-to-be-public API for diagnostic processors. Per Hyrum's Law, I think we should be cautious about what we put on a public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair. I also thought about SortedDiagonsticLoc in place of the pair, or whatever would make sense. A bare int32_t in a pair is fairly non-descriptive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a wrapper struct sounds like a nice approach, as it'd give us a field name and a place to put documentation of the value separate from the ConvertLoc function.

@@ -23,7 +37,7 @@ auto SemIRDiagnosticConverter::ConvertLoc(SemIRLoc loc,
"If we get invalid locations here, we may need to more "
"thoroughly track ImportDecls.");

DiagnosticLoc in_import_loc;
std::pair<DiagnosticLoc, int32_t> in_import_loc;
auto import_loc_id = cursor_ir->insts().GetLocId(import_ir.decl_id);
if (import_loc_id.is_node_id()) {
// For imports in the current file, the location is simple.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the calls to ConvertLocInFile here can produce byte offsets into files other than the current source file. In that case maybe we should set the byte offset to -1 and allow ConvertLoc to pick a better one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ConvertLoc

explicit TokenDiagnosticConverter(const TokenizedBuffer* buffer)
: buffer_(buffer) {}
explicit TokenDiagnosticConverter(const TokenizedBuffer* tokens)
: tokens_(tokens) {}

// Map the given token into a diagnostic location.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also describe the int32_t in the output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing comment to note it's just implementing a virtual function, to avoid duplicating/diverging documentation.

@@ -9,7 +9,21 @@ namespace Carbon::Check {

auto SemIRDiagnosticConverter::ConvertLoc(SemIRLoc loc,
ContextFnT context_fn) const
-> DiagnosticLoc {
-> std::pair<DiagnosticLoc, int32_t> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also wondering the same, since they are created and travel together so much. Might it help if the DiagnosticLoc field was an "ordering key" instead of an explicit "offset" which may be from a different file?

Copy link
Contributor

@zygoloid zygoloid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. Not merging in case you want to add the wrapper struct for DiagnosticLoc + the offset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants