Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(deps): bumping regex-syntax #320

Merged
merged 4 commits into from
Feb 7, 2024

Conversation

jeertmans
Copy link
Collaborator

@jeertmans jeertmans commented Jun 26, 2023

The regex-syntax introduces breaking changes between 0.6 and 0.8.

This PR tries to bump the regex-syntax dependency, to make Logos a bit more future-proof regarding that :-)

Major changes

Since regex-syntax seems to do some optimization on the HIR, some patterns see their priority be affected by this.

E.g.: "a|b" (priority = 2) is optimized into "[a-b]" (priority = 2).

Those optimizations also result in better performances for Logos, so I suggest to not go against them :-) See benchmark below.

As a result, this PR proposes to change the priority of classes, so that equivalent regex patterns have the same priority. This "rule" is tested in logos-codegen/src/mir.rs.

This is a breaking change, but will hopefully never change in the future, as regex HIRs cannot be optimized in something else than an equivalent pattern.

NOTE: this should be documented in the release notes, as well as in the book. Maybe after #319 is merged?

@jeertmans jeertmans marked this pull request as draft June 26, 2023 11:04
@jeertmans jeertmans changed the title chore(deps): chore(deps): bumping regex-syntax Jun 26, 2023
logos-codegen/src/mir.rs Outdated Show resolved Hide resolved
logos-codegen/src/mir.rs Outdated Show resolved Hide resolved
logos-codegen/src/mir.rs Outdated Show resolved Hide resolved
@hellow554
Copy link

hellow554 commented Jun 26, 2023

Funny enough: I also tried to update all dependencies in the logos eco system (not just regex-syntax).
Now only the test fail for me and this is two folded.

First, literals are not just one byte/char at a time, but instead a whole string/byte slice.

Therefore the following line should be changed as following:

Mir::Literal(_) => 2,

Mir::Literal(lit) => 2 * lit.0.len()

Second, regex-syntax does some crazy optimization, e.g. a|b is not an alternation, but instead a class, e.g. a..=b and this is correct. Even a|c is not an alternation, but a Class of a..=a, c..=c, which is all correct, but this means that some calculations are in the test cases are "wrong".

I'm not entirely sure, if they should be dropped or relaxed, but this works:

--- a/logos-codegen/src/graph/regex.rs
+++ b/logos-codegen/src/graph/regex.rs
@@ -207,7 +206,7 @@ mod tests {

         let mir = Mir::utf8("a|b").unwrap();

-        assert_eq!(mir.priority(), 2);
+        assert!((1..=2).contains(&mir.priority()));

diff --git a/logos-codegen/src/mir.rs b/logos-codegen/src/mir.rs
index 18a6307..5f258e8 100644
--- a/logos-codegen/src/mir.rs
+++ b/logos-codegen/src/mir.rs
@@ -207,17 +204,17 @@ mod tests {
     #[test]
     fn priorities() {
         let regexes = [
-            ("[a-z]+", 1),
-            ("a|b", 2),
-            ("a|[b-z]", 1),
-            ("(foo)+", 6),
-            ("foobar", 12),
-            ("(fooz|bar)+qux", 12),
+            ("[a-z]+", (1..=1)),
+            ("a|b", (1..=2)),
+            ("a|[b-z]", (1..=1)),
+            ("(foo)+", (6..=6)),
+            ("foobar", (12..=12)),
+            ("(fooz|bar)+qux", (12..=12)),
         ];

         for (regex, expected) in regexes.iter() {
             let mir = Mir::utf8(regex).unwrap();
-            assert_eq!(mir.priority(), *expected);
+            assert!(expected.contains(&mir.priority()));
         }
     }
 }

Comment on lines 72 to 76
Mir::Literal(lit) => {
match std::str::from_utf8(&lit.0) {
Ok(s) => 2 * s.len(),
Err(_) => 2 * lit.0.len(),
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you're doing this because of unicode strings?
Is it possible to mix string and byte pattern in logos?
If so, can you provide a testcase where this is covered?

If not, I think doing the simple solution, e.g. 2 * lit.0.len() is enough here. But I don't have the last word of course ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should come with a test for that :-)

But I did that because Mir::Literal can be constructed either from a char or from an u8. But a single char, which should have priority 2, can be made of up to 4 bytes. So using 2 * lit.0.len() for such a char could make the priority much higher. I will try to write some tests for that :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh I made the dummy mistake, but str's length should be s.chars().count() :'-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we talk about grapheme clusters? I mean could either be just or e~.
Please take a look at: https://stackoverflow.com/questions/58770462/how-to-iterate-over-unicode-grapheme-clusters-in-rust

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talk about unicode chars, see regex#54.

I don't think regex-syntax or logos supports grapheme clusters, so I don't see why we should support them now?

@jeertmans
Copy link
Collaborator Author

Funny enough: I also tried to update all dependencies in the logos eco system (not just regex-syntax). Now only the test fail for me and this is two folded.

First, literals are not just one byte/char at a time, but instead a whole string/byte slice.

Therefore the following line should be changed as following:

Mir::Literal(_) => 2,

Mir::Literal(lit) => 2 * lit.0.len()

Second, regex-syntax does some crazy optimization, e.g. a|b is not an alternation, but instead a class, e.g. a..=b and this is correct. Even a|c is not an alternation, but a Class of a..=a, c..=c, which is all correct, but this means that some calculations are in the test cases are "wrong".

I'm not entirely sure, if they should be dropped or relaxed, but this works:

--- a/logos-codegen/src/graph/regex.rs
+++ b/logos-codegen/src/graph/regex.rs
@@ -207,7 +206,7 @@ mod tests {

         let mir = Mir::utf8("a|b").unwrap();

-        assert_eq!(mir.priority(), 2);
+        assert!((1..=2).contains(&mir.priority()));

diff --git a/logos-codegen/src/mir.rs b/logos-codegen/src/mir.rs
index 18a6307..5f258e8 100644
--- a/logos-codegen/src/mir.rs
+++ b/logos-codegen/src/mir.rs
@@ -207,17 +204,17 @@ mod tests {
     #[test]
     fn priorities() {
         let regexes = [
-            ("[a-z]+", 1),
-            ("a|b", 2),
-            ("a|[b-z]", 1),
-            ("(foo)+", 6),
-            ("foobar", 12),
-            ("(fooz|bar)+qux", 12),
+            ("[a-z]+", (1..=1)),
+            ("a|b", (1..=2)),
+            ("a|[b-z]", (1..=1)),
+            ("(foo)+", (6..=6)),
+            ("foobar", (12..=12)),
+            ("(fooz|bar)+qux", (12..=12)),
         ];

         for (regex, expected) in regexes.iter() {
             let mir = Mir::utf8(regex).unwrap();
-            assert_eq!(mir.priority(), *expected);
+            assert!(expected.contains(&mir.priority()));
         }
     }
 }

Oh yeah... this optimization is quite annoying, and seems to be caused by the AST to HIR translator, see https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6e993bda9d1d375ea2dd0bd3927a38a6.

If think there are three possible solutions to this:

  • revert back to regex-syntax v0.6 and never upgrade;
  • update the class' priority to 2, which is a breaking change, but satisfy the rule that "equivalent regexes the same priority". I worked a bit on that, but a lot of tests must be updated;
  • use regex-syntax's AST, which seems to be not optimized, instead of the HIR, or use a custom AST to HIR translator (I am not sure that optimization can be turned off).

It is also important to check whether regex-syntax's optimizations are actually impacting (or not) Logos' performances :'-)

@jeertmans
Copy link
Collaborator Author

Ok so I did a small benchmark to compare equivalent patterns, and it seems that optimizing alternations into classes (at least a range) can improve performances, see this example:

image

@hellow554
Copy link

can improve performances

Yeah I thought so ;) Why should they do that, if not for performance :D

I mean, honestly: There's no reason for us so stick to "old" behavoir. We just need to make sure, that priorities do not conflict with each other. And I think we can do this with more tests, but IMHO we shouldn't block the upgrade just because old tests are not working.

@jeertmans
Copy link
Collaborator Author

can improve performances

Yeah I thought so ;) Why should they do that, if not for performance :D

True! But optimizations in the regex-syntax crate do not necessarily translate into performance gains with Logos, so testing is always good ;-)

I mean, honestly: There's no reason for us so stick to "old" behavoir. We just need to make sure, that priorities do not conflict with each other. And I think we can do this with more tests, but IMHO we shouldn't block the upgrade just because old tests are not working.

I agree, but this is a major change, and hopefully we should be able to produce a stable result such that "equivalent regex patterns have the same priority". Maybe it is as simple as multiplicating the class priority by 2 ^^'

Anyway, I'd like to have @maciejhirsz's opinion on that too.

@jeertmans jeertmans force-pushed the bump-regex-syntax branch from b055070 to 22a13f7 Compare June 27, 2023 08:53
@jeertmans jeertmans marked this pull request as ready for review June 27, 2023 09:58
logos/src/lib.rs Outdated
@@ -231,7 +231,7 @@ pub trait Logos<'source>: Sized {
/// enum Token<'a> {
/// // We will treat "abc" as if it was whitespace.
/// // This is identical to using `logos::skip`.
/// #[regex(" |abc", |_| Skip)]
/// #[regex(" |abc", |_| Skip, priority = 3)]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this is necessary? Is this because the alternate form gets shortened to a class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is because the next pattern has seen its priority go from 1 to 2. So we increase this priority to have the same behaviour as before.

@hellow554
Copy link

Your commits seem to be messed up. You have duplicate commits.

Could you rebase your commits, or maybe squash them (I don't like squashing, but it's okay to use).

@@ -71,8 +68,11 @@ impl Mir {
Mir::Empty | Mir::Loop(_) | Mir::Maybe(_) => 0,
Mir::Concat(concat) => concat.iter().map(Mir::priority).sum(),
Mir::Alternation(alt) => alt.iter().map(Mir::priority).min().unwrap_or(0),
Mir::Class(_) => 1,
Mir::Literal(_) => 2,
Mir::Class(_) => 2,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to adjust Class to something different.
There is the iter function on ClassUnicode and those items have a len method.
Maybe that is what we want instead of a hard coded class?! I'm not really sure tbh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will have a different result: a non-empty class means that at it must match one element from the class. All "special" characters are escaped in a class, and you can only match one byte or one char → priority is 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not really sure what you need to adjust all the priorities.
Can you explain that please?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because bumping Mir::Class's priority from 1 to 2 broke a lot of tests or regexes, and we now need to specify the priority in such cases

@jeertmans jeertmans merged commit a0653b7 into maciejhirsz:master Feb 7, 2024
18 checks passed
@jeertmans jeertmans deleted the bump-regex-syntax branch February 7, 2024 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants