Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat private-use characters like non-printable characters for escaping #4015

Closed
4 tasks
RoelN opened this issue Jan 8, 2025 · 11 comments
Closed
4 tasks

Treat private-use characters like non-printable characters for escaping #4015

RoelN opened this issue Jan 8, 2025 · 11 comments
Assignees

Comments

@RoelN
Copy link

RoelN commented Jan 8, 2025


Example in playground

The following code:

$one: \31;
$pua: \e000;

div::before {
  content: unquote("\"#{$one}\"");
}

div::after {
  content: unquote("\"#{pua}\"");
}

results in the following CSS:

div::before {
  content: "\31 ";
}

div::after {
  content: "\e000";
}

Please see the space added after \31. This is being added when using the Unicode value for the number 1, but not for the Unicode value for characters in the PUA. The latter is correct, there shouldn't be a trailing space.

@ntkme
Copy link
Contributor

ntkme commented Jan 8, 2025

It seems to be a parser bug instead of a serializer bug: https://sass-lang.com/playground/#eJwzNHQoLU5VUCpOLC62Ki4pysxLV7LmUkm0UogxAAJjQ2suh5TUpNJ0BYikXk5qXnpJhoZKoiZMpLA0vyQVJADUlwTWh1tXErquJJCuZKAu3HqS0fUkg/SkAPWkAl2IS1cKuq4UTWsAkQNLNQ==?s=L1C1-L9C43

\31 is parsed to have a string length of 4, that it was parsed into a unquoted string \31 , with an extra space at the end.

The proper behavior here should be that at parse phase \31 should be deserialized as a single character string 1 and it should not get escaped at all during serialization.

@ntkme
Copy link
Contributor

ntkme commented Jan 8, 2025

Likely root cause is at this line: https://github.com/sass/dart-sass/blob/9e6e3bfbd28fa07bd0df63cfcb85d2db9ef9b6c2/lib/src/parse/parser.dart#L472

That there is a special logic that when parsing string as identifier, already escaped ascii number 0-9 (\30 - \39) will be explicitly be parsed as \30 - \39 .

@nex3 I wonder why this special treatment is done during parsing phase. Maybe because escaped 0-9 indicating that this string must always be an identifier? Shouldn't the special escape for numbers at the beginning of an identifier to be applied at serialization based on whether it's outputting an identifier or not?

@nex3
Copy link
Contributor

nex3 commented Jan 9, 2025

The issue here isn't that the space is being added \31, it's that it's not being added after \e000. See Consuming an Escaped Code Point:

  • Otherwise, if codepoint is a non-printable code point, U+0009 CHARACTER TABULATION, U+000A LINE FEED, U+000D CARRIAGE RETURN, or U+000C FORM FEED; or if codepoint is a digit and the start flag is set:

    • Let code be the lowercase hexadecimal representation of codepoint, with no leading 0s.

    • Return "\" + code + " ".

The space itself is part of the CSS syntax for escape codes (see § 4.3.7. Consume an escaped code point). We want to include this consistently in the canonicalized format of parsed identifiers so that equivalent identifiers are always equal, while also ensuring that there isn't weird behavior like the identifier form of 1a being one character longer than the identifier form of 1x.

Edit: Sorry, I'm wrong in that the Sass spec does not mandate a space after \e000 because it's not considered a "non-printable code point". Technically, according to the spec, the canonical form of \e000 should be (that is, the literal U+E000 PRIVATE USE AREA code point). I think that's not a desirable behavior, though; we should define private-use characters to be considered "non-printable" for this purpose. I'm going to move this to the spec repo accordingly.

@nex3 nex3 added the bug Something isn't working label Jan 9, 2025
@nex3 nex3 self-assigned this Jan 9, 2025
@nex3 nex3 removed the bug Something isn't working label Jan 9, 2025
@nex3 nex3 transferred this issue from sass/dart-sass Jan 9, 2025
@nex3 nex3 changed the title Trailing space added for Unicode values for numbers Treat private-use characters like non-printable characters for escaping Jan 9, 2025
@ntkme
Copy link
Contributor

ntkme commented Jan 9, 2025

As far as I know the space is optional, and only required is the next token is a space character or hex character?

@nex3
Copy link
Contributor

nex3 commented Jan 9, 2025

That's right, but because the way it's canonicalized is observable—the SassScript value of the identifier \31 is a four-character unquoted string containing [\, 3, 1, ]—we want to make the canonical form as consistent as possible.

This is a downstream effect of the way we use the "unquoted string" datatype to represent not just identifiers but any CSS value we don't have a dedicated type for, including things like plain-CSS functions and so on. Where quoted strings just store their semantic values, unquoted strings store their syntactic values. This means that identifiers are stored escaped, so we need to be consistent about how they're escaped so we don't have weird issues where, for example, \@x and \64 x , and \64x are all treated as different values despite being semantically identical. This is what the Identifier Escapes proposal was all about.

@ntkme
Copy link
Contributor

ntkme commented Jan 9, 2025

It seems to me that there is a limitation that we cannot clearly tell the difference between an unquoted identifier string or an unquoted non-identifier string, and that’s why we are just parsing it without decoding the escape sequence to force it to be outputted as is.

My question is that why cannot we always decode the escape sequence during parsing stage (as this example in question is not a Sass value from JS script but a value directly in sass source input). In other words, parse \31 as unquoted string 1 during parse stage and later during output serialization, either print a string 1 or \31 or \31 based on the context that we are writing the css output?

@RoelN
Copy link
Author

RoelN commented Jan 9, 2025

Thanks for looking into this. Please note that the space seems to be removed in CSS when concatted with text: https://codepen.io/RoelN/pen/zxORvxV

But then again I'd expect the output of

$one: \31;

div::before {
  content: unquote("\"#{$one}#{$one}#{$one}\"");
}

to be

div::before {
  content: "\31\31\31 ";
}

(with our without the trailing space)

and not

div::before {
  content: "\31 \31 \31 ";
}

@nex3
Copy link
Contributor

nex3 commented Jan 11, 2025

I actually investigated this more on Wednesday and concluded that my proposed solution wasn't quite right, and in fact was based on a misunderstanding of the actual behavior being shown above. $pua: \e000 isn't being parsed as a five-code-point raw escape string; it's being parsed as an unquoted string containing the single code point U+E000. The fake-quoted-string is serialized as "\e000" because the output style is expanded; if you switch it to compressed, it's serialized as "" instead. As such, the current behavior is actually correct: private-use characters don't have any special parsing considerations, and that doesn't need any changes.

It seems to me that there is a limitation that we cannot clearly tell the difference between an unquoted identifier string or an unquoted non-identifier string, and that’s why we are just parsing it without decoding the escape sequence to force it to be outputted as is.

My question is that why cannot we always decode the escape sequence during parsing stage (as this example in question is not a Sass value from JS script but a value directly in sass source input). In other words, parse \31 as unquoted string 1 during parse stage and later during output serialization, either print a string 1 or \31 or \31 based on the context that we are writing the css output?

Because we use unquoted strings to represent both identifiers and CSS values we otherwise don't parse. We need to be able to distinctly represent the value url\28 foo\29 (a valid CSS identifier) and url(foo) (a valid CSS URL expression), and in order to do that consistently we have to store them as the raw text that's used in CSS rather than as the semantic value of an identifier.

If I had the language to design over again, I might split up the "unquoted string" data type into separate "identifier string" and "raw string" types, where "identifier string" works like quoted strings and contains semantic code point values and "raw string" works like unquoted strings do today and is a purely syntactic representation. But I think making such a fundamental change at this point would be a lot more trouble than it would be worth.

Please note that the space seems to be removed in CSS when concatted with text: https://codepen.io/RoelN/pen/zxORvxV

That's because the space is not part of the string. It's part of the escape code. See the railroad diagram for an escape code, which includes an optional trailing whitespace character.

Note that this unquote("\"#{...}\"") construct you're doing is very strange and almost certainly unnecessary for whatever your actual use-case is. It's likely that just building a string out of quoted strings will get you the result you want much more straightforwardly and with less confusion around the way unquoted identifiers are parsed.

@nex3 nex3 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2025
@RoelN
Copy link
Author

RoelN commented Jan 15, 2025

Thanks for the explanation!

Just to be absolutely 100% sure, there is no way to avoid variables being parsed? So \61 will always be turned into an a?

See this playground:

This

$aaa: "\61";

div::before {
  content: $aaa;
}

will always parse to

div::before {
  content: "a";
}

and we can never create the following from Sass:

div::before {
  content: "\61";
}

My use case, if it helps: I want to produce CSS where all Unicode values are readable, so developers can see the actual hex value, like \61, \1000, \1f603, etc. If these get turned into a, က and 😃 these values are obscured.

@nex3
Copy link
Contributor

nex3 commented Jan 22, 2025

@RoelN tl;dr: sass/dart-sass#568 is probably the best solution to your issue.

Sass stores (quoted) string values as their semantic contents, so from Sass's perspective there's literally no difference between the value "\61" and the value "a", or between "\1f603" and "😃". Ultimately the decision of how to serialize these strings is purely presentational; it doesn't affect the semantics of the CSS at all, just how legible it is one way or another to a human reader. We made the decision to serialize all printable, non-private-use non-ASCII characters as real Unicode because serializing as escapes both results in a larger output size and completely illegible text in languages that use non-ASCII characters.

The ideal way to solve this is to add a flag that tells Sass to emit non-ASCII code points as escapes rather than literal characters, which is covered by sass/dart-sass#568. If you don't want to wait for (or contribute to) that issue, the next best way is probably to have a post-processing step that makes this change outside of Sass.

@RoelN
Copy link
Author

RoelN commented Jan 22, 2025

@nex3 I appreciate the thorough responses! Thanks for helping me understand what's going on. I'll poll with the team on how we want to deal with this, but it looks like Sass' default approach is solid and we'll probably stick to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants