perf(syntax): short-circuit if name matches `language_id` #12407

RoloEdits · 2025-01-04T23:44:06Z

Previously, there would always be a Regex::find that would take place, even if the name perfectly matches an id(name in the language.toml). This PR introduces a short-circuit opportunity, to check if the name matches the id, as regex is much more expensive and shouldn't be done unless as a fallback, especially when the most often case appears to be an exact match.

The workload here was to go into helix-term::commands::typed.rs, and at the start of the file, enter insert mode, and hold enter to insert newlines until it reaches line ~1000. This is artificial, but it shows the point well. In just this operation, it accounted for 16% of samples. After, 0.5%.

Before:

After:

pascalkuthe

good find. This code will probably change quite a bit soon with the new TS implementation (and with the new config system I want to change language detection also) but lets take the easy win for now and this will also remind us to be mindful/retest once it gets rewritten.

I wonder if we could add somekind of benchmarks for these kinds of thigns. Testing with flamegraphs is nice for discovery but a benchmark suite to test for regressions would be nice (for example based on brunch)

RoloEdits · 2025-01-05T01:10:22Z

I have started looking into setting up some benchmarking infrastructure to see what kind of approach is simplest. I was looking into divan, marketed as being simple to use and setup, as I had heard good things about it. this part is particularly enticing to me. Rather than having to move a bunch of code around to specific modules/paths, and have to deal with private functions that might be important to bench, and making wrappers for them, dealing with feature flags, like a bench, to improve the ergonomics. We can just put the benchmarks "next" to the code, like how tests can be done.

cargo tree for divan:

divan v0.1.17 (D:\source\divan)
├── cfg-if v1.0.0
├── clap v4.5.23
│   └── clap_builder v4.5.23
│       ├── anstyle v1.0.10
│       ├── clap_lex v0.7.4
│       └── terminal_size v0.4.1
│           └── windows-sys v0.59.0
│               └── windows-targets v0.52.6
│                   └── windows_x86_64_msvc v0.52.6
├── condtype v1.3.0
├── divan-macros v0.1.17 (proc-macro) (D:\source\divan\macros)
│   ├── proc-macro2 v1.0.92
│   │   └── unicode-ident v1.0.14
│   ├── quote v1.0.38
│   │   └── proc-macro2 v1.0.92 (*)
│   └── syn v2.0.94
│       ├── proc-macro2 v1.0.92 (*)
│       ├── quote v1.0.38 (*)
│       └── unicode-ident v1.0.14
└── regex-lite v0.1.6
|
[dev-dependencies]
└── mimalloc v0.1.43
    └── libmimalloc-sys v0.1.39
        └── libc v0.2.169
        [build-dependencies]
        └── cc v1.2.7
            └── shlex v1.3.0

I'll definitely look into this more, and open an issue about a path forward, if I can find an obvious one, that isn't going to be a hassle to maintain.

pascalkuthe · 2025-01-05T01:13:38Z

I love divans API but unfortunaetly it doens't do any statistical analyses at all so I don't quite trust the results.... Also no tracking between runs.

criterion is way too heavy but I have been having a preference for brunch so far (but divan definnitly has the nicest api by far)

RoloEdits · 2025-01-05T02:13:35Z

Ah, forgot to ask if this is intended to be used in CI or just locally, when making changes? Or perhaps we start with the easiest option first, local, to dip our toes, and then get more advanced as needed?

the-mikedavis

This is a behavior change, no? If a language doesn't have an injection-regex set then it currently won't be considered for injection and the injection regex might not necessarily accept the language's name as we have it written in languages.toml.

I think we should separate how we lookup languages when setting the injection.language property - i.e. (#set! injection.language "bash") - vs. capturing - i.e. (info_string (language) @injection.language). For capturing we should probably use a regex (maybe with this change too, I'm not sure) but for the property we should only accept language names (no regex). Then for the common case (property) we use the fast lookup and only resort to the slower regex for @injection.language captures.

RoloEdits · 2025-01-05T17:44:27Z

So you mean like there is a false positive, where it can find based on name, but there isn't actually anything to inject, and thus should return None, not Some(lang with no injection regex)?

RoloEdits · 2025-01-15T02:32:29Z

So I did some digging, and here is my general assessment.

First with divan, it seems the variation is similar to what I have noticed with brunch. But the magic promise of "write the benchmarks anywhere" has its own caveats. First is that because its in the same crate it can get access to inlining that might not be happening to external users outside of it, which could heavily skew results. Second is that it seems like just having the one benches/bench.rs file with its divan::main is not enough to make the magic work. If the crate is not brought into scope in some way then no symbols will be emitted for it and thus it cannot register the benchmarks in a test module, for example. The issue of testing the source code directly also needs to have feature flags to control, which divan itself uses to benchmark its own source code.

Brunch is easy enough to setup, and the last run saved is nice, but what I have used with criterion is setting a baseline and then comparing the results, with something like with critcmp. I'm not sure how much actual value a run to run gets when you might be branching off of main for a refactor and need to compare to master.

criterion is heavy, but it also seems to be the only one to offer a basic set of features to do more than once off runs. And its output also seems to be the main kind supported with 3rd parties like https://bencher.dev/.

It does seem like the cargo team is working towards solutions for all of these things, including but not limited to standardizing benchmark outputs, attribute macros like #[used] or #[retain], as well as general benchmark harness progress. Unfortunately this space is not as developed as I would have liked. Lots of patchy capabilities that mostly seem to stem from cargo stuff/choices that have/havent been made by the core team.

What seems to be the most common setup is to just have a new feature that would expose the functions wanted for benchmarks, like bench, and then set up a normal benches directory and set up like normal. This can be done with any of the libraries. I think we can start with brunch and if the one off way doesnt work we can change. Most of the work would already be done just setting up the feature over the functions to test. The benches work is just the setup. Most instances of a workflow I imagine would be to bench from master and then make a new branch and make changes and then just compare the time, not the percent change, and go from there. Which any of these would support.

I think we can start piecemeal, something I would be willing to get off the ground, and then just grow overtime. Not sure what areas we should start with though. Most of the hotspots I have seen are from syscalls that might not be avoidable or bleed over to other library implementations, like with tree-sitter.

One area could be improved would be using a different json deserializing library for the language server json_rpc stuff, as that has been a very large part of the samples. Something like simdjson or sonic-rs that can take advantage of simd instructions on modern hardware. For general use of the editor this even a 10% improvement could be noticeable overall improvement, as far as samples are concerned.

This is an example from a sampling I did earlier:

You can also see some samples from syntax highlighting next to it. But it might not be worth to touch this code as I believe its going be changed out anyways. Though I guess benchmarking could be useful for that too to see the difference.

RoloEdits · 2025-01-15T02:44:44Z

As for the name not matching up with a regex, the for_language_id only returns if the name is a direct match. If there needs to be an extra check for in there is anything to inject then there could be an additional is_some_and to check in injection_regex.is_some that would filter that. If its not a direct name match with injection regex to return, then it should function like before in the fallback, no?

the-mikedavis · 2025-01-16T16:54:32Z

I think the behavior change here is probably fine: we would always consider a [[language]]'s name to match in markdown codefences for example.

```<language-id>     # <- always matches [[language]] name field
.. injected code in here ..
```

The only thing to consider is whether there is a language where the injection_regex wouldn't accept the name.

I looked more into the change I was referring to above and it seems like an easy win and leads to other nice refactors so I'll post a patch for it. It probably steals most of the performance gains from this patch in the common case but this patch would still be valuable for the sake of injections that use @injection.language. (edit: #12561)

RoloEdits · 2025-01-16T19:11:50Z

The only thing to consider is whether there is a language where the injection_regex wouldn't accept the name.

Like exactly, or using the closest/(longest?) match logic now? For example, protobuf would be the name, but it wouldnt be an exact match in the injection regex, where thats just proto.

the-mikedavis · 2025-01-17T14:52:33Z

The "proto" regex would match "protobuf" I believe since, IIRC, we don't check that it's an exact match (i.e. ^proto$)

perf(syntax): short-circuit if name matches language_id

cb8aa0c

pascalkuthe approved these changes Jan 5, 2025

View reviewed changes

the-mikedavis reviewed Jan 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(syntax): short-circuit if name matches `language_id` #12407

perf(syntax): short-circuit if name matches `language_id` #12407

RoloEdits commented Jan 4, 2025 •

edited

Loading

pascalkuthe left a comment

RoloEdits commented Jan 5, 2025 •

edited

Loading

pascalkuthe commented Jan 5, 2025

RoloEdits commented Jan 5, 2025

the-mikedavis left a comment

RoloEdits commented Jan 5, 2025 •

edited

Loading

RoloEdits commented Jan 15, 2025

RoloEdits commented Jan 15, 2025

the-mikedavis commented Jan 16, 2025 •

edited

Loading

RoloEdits commented Jan 16, 2025

the-mikedavis commented Jan 17, 2025

perf(syntax): short-circuit if name matches language_id #12407

Are you sure you want to change the base?

perf(syntax): short-circuit if name matches language_id #12407

Conversation

RoloEdits commented Jan 4, 2025 • edited Loading

pascalkuthe left a comment

Choose a reason for hiding this comment

RoloEdits commented Jan 5, 2025 • edited Loading

pascalkuthe commented Jan 5, 2025

RoloEdits commented Jan 5, 2025

the-mikedavis left a comment

Choose a reason for hiding this comment

RoloEdits commented Jan 5, 2025 • edited Loading

RoloEdits commented Jan 15, 2025

RoloEdits commented Jan 15, 2025

the-mikedavis commented Jan 16, 2025 • edited Loading

RoloEdits commented Jan 16, 2025

the-mikedavis commented Jan 17, 2025

perf(syntax): short-circuit if name matches `language_id` #12407

perf(syntax): short-circuit if name matches `language_id` #12407

RoloEdits commented Jan 4, 2025 •

edited

Loading

RoloEdits commented Jan 5, 2025 •

edited

Loading

RoloEdits commented Jan 5, 2025 •

edited

Loading

the-mikedavis commented Jan 16, 2025 •

edited

Loading