Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path to supporting 100% of grammars in the JS engine #876

Open
slevithan opened this issue Dec 27, 2024 · 3 comments
Open

Path to supporting 100% of grammars in the JS engine #876

slevithan opened this issue Dec 27, 2024 · 3 comments

Comments

@slevithan
Copy link
Collaborator

slevithan commented Dec 27, 2024

Shiki's JS engine uses Oniguruma-To-ES under the hood to emulate Oniguruma regexes. As of the latest version, the vast majority of grammars (which contain dozens to thousands of regexes each) are supported.

Following is what's needed to support the remaining grammars.

Mismatched

These are grammars that produce different results in Shiki's JS and WASM engines for the provided grammar samples.

  • Kusto
    • Includes a regex that triggers an Oniguruma bug, so in fact the JS engine (not the WASM engine) is producing the correct highlighting. The Oniguruma bug needs to be worked around in the Kusto grammar. Upstream issue here.
    • Fix landed in upstream grammar. Published in Shiki 1.25.0.
  • NGINX
    • Issue unknown. Any insight appreciated.

Unsupported

These are grammars that throw an error for at least one regex in Shiki's JS engine, which by default does not silence errors like the WASM engine. A few of these grammars are also showing mismatches, but those are likely to be resolved by resolving the errors (since errors can lead to different results either directly or for subsequent regex matches).

  • Ada
    • Includes an invalid Oniguruma regex. Upstream PR here.
    • Fix landed in upstream grammar. Published in Shiki 1.25.0.
  • Hack
    • Includes a regex that triggers an Oniguruma bug (that Oniguruma-To-ES chooses to throw for, rather than reproducing). Upstream PR here.
    • Fix landed in upstream grammar. Expected to be published in the next Shiki release.
  • Swift
    • Uses conditionals (?(…)…|…) (the only grammar that does so) in three regexes.
      • Conditionals aren't currently supported for emulation and aren't emulatable in all cases.
      • The regexes need to be refactored to remove the conditionals. Upstream issue here.
  • C#
    • Uses absence groups (?~…) (the only grammar that does so) in two regexes.
      • Absence groups aren't currently supported for emulation. However, they can be (see issue for adding support).
    • Uses multiple overlapping recursions (one of only two grammars that does so) in one regex.
      • This is technically supportable, but it's not feasible to do so for performance and other reasons. Instead, it would be better to file an issue upstream to refactor the overlapping recursions in the C# grammar.
  • Razor
    • Uses an invalid JS identifier as a group name (the only grammar that does so) in two regexes.
      • This is technically supportable (by e.g. renaming the group type-name as type_name$1 or just removing the name), but handling this comprehensively adds some complications. It's probably better to just file an issue upstream to rename the problematic groups in the Razor grammar.
    • Additionally shares all of C#'s errors because it embeds the C# grammar.
  • PureScript
    • Uses multiple overlapping recursions (one of only two grammars that does so) in one regex.
      • See related comments for C#. Should file an issue upstream to refactor the overlapping recursions in the PureScript grammar.

Resolving all of the above would result in 100% JS engine support for Shiki's grammar samples. Quite a remarkable feat if you're familiar with the challenges and complexity of getting to this point.

Any help with getting these grammars supported in the JS engine (e.g. liking or commenting on the existing issues I linked here, filing the needed issues described above, investigating the source of NGINX mismatches, submitting PRs for grammars that need refactoring, or contributions to Oniguruma-To-ES) would be very welcome.

@slevithan
Copy link
Collaborator Author

@antfu Would it be a good idea to add a requirement for new grammars added to https://github.com/shikijs/textmate-grammars-themes that they need to support the JS engine? As you can see above, this will rarely be a factor, and when it is, it might be because of a bug in the grammar or Oniguruma itself that the author didn't account for.

@antfu
Copy link
Member

antfu commented Dec 30, 2024

That's astonishing work you have done. Thanks a lot for all the effort!

added to shikijs/textmate-grammars-themes that they need to support the JS engine?

Oh yeah, that would be great! We could introduce verification scripts to that repo, so we could catch them up in the CI. We could whitelist the known incompatible ones, so it always covers the new grammar.

@slevithan
Copy link
Collaborator Author

slevithan commented Jan 8, 2025

oniguruma-to-es v1.0.0 added robust validation of lookbehind contents, causing it to identify and throw for one invalid Oniguruma regex in the CodeQL grammar (it uses lookahead within lookbehind, which is valid in JS but not Oniguruma). As a result, CodeQL has moved to Unsupported.

Here is where the the invalid regex is located: tm-grammars, upstream, upstream generated from here.

This is easy to fix, but needs to be fixed upstream. Since it's erroring in both the JS and WASM engines (but silently in the WASM engine), using the forgiving option to silence errors in the JS engine leads to it working correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants