Treat private-use characters like non-printable characters for escaping #4015

RoelN · 2025-01-08T13:42:35Z

Specification
Tests
Dart Sass
Website

The following code:

$one: \31;
$pua: \e000;

div::before {
  content: unquote("\"#{$one}\"");
}

div::after {
  content: unquote("\"#{pua}\"");
}

results in the following CSS:

div::before {
  content: "\31 ";
}

div::after {
  content: "\e000";
}

Please see the space added after \31. This is being added when using the Unicode value for the number 1, but not for the Unicode value for characters in the PUA. The latter is correct, there shouldn't be a trailing space.

The text was updated successfully, but these errors were encountered:

ntkme · 2025-01-08T18:19:24Z

It seems to be a parser bug instead of a serializer bug: https://sass-lang.com/playground/#eJwzNHQoLU5VUCpOLC62Ki4pysxLV7LmUkm0UogxAAJjQ2suh5TUpNJ0BYikXk5qXnpJhoZKoiZMpLA0vyQVJADUlwTWh1tXErquJJCuZKAu3HqS0fUkg/SkAPWkAl2IS1cKuq4UTWsAkQNLNQ==?s=L1C1-L9C43

\31 is parsed to have a string length of 4, that it was parsed into a unquoted string \31 , with an extra space at the end.

The proper behavior here should be that at parse phase \31 should be deserialized as a single character string 1 and it should not get escaped at all during serialization.

ntkme · 2025-01-08T18:56:23Z

Likely root cause is at this line: https://github.com/sass/dart-sass/blob/9e6e3bfbd28fa07bd0df63cfcb85d2db9ef9b6c2/lib/src/parse/parser.dart#L472

That there is a special logic that when parsing string as identifier, already escaped ascii number 0-9 (\30 - \39) will be explicitly be parsed as \30 - \39 .

@nex3 I wonder why this special treatment is done during parsing phase. Maybe because escaped 0-9 indicating that this string must always be an identifier? Shouldn't the special escape for numbers at the beginning of an identifier to be applied at serialization based on whether it's outputting an identifier or not?

nex3 · 2025-01-09T00:06:08Z

The issue here isn't that the space is being added \31, it's that it's not being added after \e000. See Consuming an Escaped Code Point:

Otherwise, if codepoint is a non-printable code point, U+0009 CHARACTER TABULATION, U+000A LINE FEED, U+000D CARRIAGE RETURN, or U+000C FORM FEED; or if codepoint is a digit and the start flag is set:

Let code be the lowercase hexadecimal representation of codepoint, with no leading 0s.

Return "\" + code + " ".

The space itself is part of the CSS syntax for escape codes (see § 4.3.7. Consume an escaped code point). We want to include this consistently in the canonicalized format of parsed identifiers so that equivalent identifiers are always equal, while also ensuring that there isn't weird behavior like the identifier form of 1a being one character longer than the identifier form of 1x.

Edit: Sorry, I'm wrong in that the Sass spec does not mandate a space after \e000 because it's not considered a "non-printable code point". Technically, according to the spec, the canonical form of \e000 should be  (that is, the literal U+E000 PRIVATE USE AREA code point). I think that's not a desirable behavior, though; we should define private-use characters to be considered "non-printable" for this purpose. I'm going to move this to the spec repo accordingly.

ntkme · 2025-01-09T00:19:38Z

As far as I know the space is optional, and only required is the next token is a space character or hex character?

nex3 · 2025-01-09T00:26:12Z

That's right, but because the way it's canonicalized is observable—the SassScript value of the identifier \31 is a four-character unquoted string containing [\, 3, 1, ]—we want to make the canonical form as consistent as possible.

This is a downstream effect of the way we use the "unquoted string" datatype to represent not just identifiers but any CSS value we don't have a dedicated type for, including things like plain-CSS functions and so on. Where quoted strings just store their semantic values, unquoted strings store their syntactic values. This means that identifiers are stored escaped, so we need to be consistent about how they're escaped so we don't have weird issues where, for example, \@x and \64 x , and \64x are all treated as different values despite being semantically identical. This is what the Identifier Escapes proposal was all about.

ntkme · 2025-01-09T03:43:04Z

It seems to me that there is a limitation that we cannot clearly tell the difference between an unquoted identifier string or an unquoted non-identifier string, and that’s why we are just parsing it without decoding the escape sequence to force it to be outputted as is.

My question is that why cannot we always decode the escape sequence during parsing stage (as this example in question is not a Sass value from JS script but a value directly in sass source input). In other words, parse \31 as unquoted string 1 during parse stage and later during output serialization, either print a string 1 or \31 or \31 based on the context that we are writing the css output?

RoelN · 2025-01-09T14:27:32Z

Thanks for looking into this. Please note that the space seems to be removed in CSS when concatted with text: https://codepen.io/RoelN/pen/zxORvxV

But then again I'd expect the output of

$one: \31;

div::before {
  content: unquote("\"#{$one}#{$one}#{$one}\"");
}

to be

div::before {
  content: "\31\31\31 ";
}

(with our without the trailing space)

and not

div::before {
  content: "\31 \31 \31 ";
}

nex3 · 2025-01-11T00:23:57Z

I actually investigated this more on Wednesday and concluded that my proposed solution wasn't quite right, and in fact was based on a misunderstanding of the actual behavior being shown above. $pua: \e000 isn't being parsed as a five-code-point raw escape string; it's being parsed as an unquoted string containing the single code point U+E000. The fake-quoted-string is serialized as "\e000" because the output style is expanded; if you switch it to compressed, it's serialized as "" instead. As such, the current behavior is actually correct: private-use characters don't have any special parsing considerations, and that doesn't need any changes.

It seems to me that there is a limitation that we cannot clearly tell the difference between an unquoted identifier string or an unquoted non-identifier string, and that’s why we are just parsing it without decoding the escape sequence to force it to be outputted as is.

My question is that why cannot we always decode the escape sequence during parsing stage (as this example in question is not a Sass value from JS script but a value directly in sass source input). In other words, parse \31 as unquoted string 1 during parse stage and later during output serialization, either print a string 1 or \31 or \31 based on the context that we are writing the css output?

Because we use unquoted strings to represent both identifiers and CSS values we otherwise don't parse. We need to be able to distinctly represent the value url\28 foo\29 (a valid CSS identifier) and url(foo) (a valid CSS URL expression), and in order to do that consistently we have to store them as the raw text that's used in CSS rather than as the semantic value of an identifier.

If I had the language to design over again, I might split up the "unquoted string" data type into separate "identifier string" and "raw string" types, where "identifier string" works like quoted strings and contains semantic code point values and "raw string" works like unquoted strings do today and is a purely syntactic representation. But I think making such a fundamental change at this point would be a lot more trouble than it would be worth.

Please note that the space seems to be removed in CSS when concatted with text: https://codepen.io/RoelN/pen/zxORvxV

That's because the space is not part of the string. It's part of the escape code. See the railroad diagram for an escape code, which includes an optional trailing whitespace character.

Note that this unquote("\"#{...}\"") construct you're doing is very strange and almost certainly unnecessary for whatever your actual use-case is. It's likely that just building a string out of quoted strings will get you the result you want much more straightforwardly and with less confusion around the way unquoted identifiers are parsed.

RoelN · 2025-01-15T12:44:57Z

Thanks for the explanation!

Just to be absolutely 100% sure, there is no way to avoid variables being parsed? So \61 will always be turned into an a?

See this playground:

This

$aaa: "\61";

div::before {
  content: $aaa;
}

will always parse to

div::before {
  content: "a";
}

and we can never create the following from Sass:

div::before {
  content: "\61";
}

My use case, if it helps: I want to produce CSS where all Unicode values are readable, so developers can see the actual hex value, like \61, \1000, \1f603, etc. If these get turned into a, က and 😃 these values are obscured.

nex3 · 2025-01-22T02:07:47Z

@RoelN tl;dr: sass/dart-sass#568 is probably the best solution to your issue.

Sass stores (quoted) string values as their semantic contents, so from Sass's perspective there's literally no difference between the value "\61" and the value "a", or between "\1f603" and "😃". Ultimately the decision of how to serialize these strings is purely presentational; it doesn't affect the semantics of the CSS at all, just how legible it is one way or another to a human reader. We made the decision to serialize all printable, non-private-use non-ASCII characters as real Unicode because serializing as escapes both results in a larger output size and completely illegible text in languages that use non-ASCII characters.

The ideal way to solve this is to add a flag that tells Sass to emit non-ASCII code points as escapes rather than literal characters, which is covered by sass/dart-sass#568. If you don't want to wait for (or contribute to) that issue, the next best way is probably to have a post-processing step that makes this change outside of Sass.

RoelN · 2025-01-22T08:29:12Z

@nex3 I appreciate the thorough responses! Thanks for helping me understand what's going on. I'll poll with the team on how we want to deal with this, but it looks like Sass' default approach is solid and we'll probably stick to that.

nex3 added the bug Something isn't working label Jan 9, 2025

nex3 self-assigned this Jan 9, 2025

nex3 removed the bug Something isn't working label Jan 9, 2025

nex3 transferred this issue from sass/dart-sass Jan 9, 2025

nex3 changed the title ~~Trailing space added for Unicode values for numbers~~ Treat private-use characters like non-printable characters for escaping Jan 9, 2025

This was referenced Jan 9, 2025

Treat private-use characters like non-printable characters for escaping sass/dart-sass#2481

Closed

Add tests for treating private-use characters like non-printable characters for escaping sass/sass-spec#2043

Closed

nex3 mentioned this issue Jan 9, 2025

Document private-use characters like non-printable characters for escaping sass/sass-site#1295

Closed

nex3 closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat private-use characters like non-printable characters for escaping #4015

Treat private-use characters like non-printable characters for escaping #4015

RoelN commented Jan 8, 2025 •

edited by nex3

Loading

ntkme commented Jan 8, 2025 •

edited

Loading

ntkme commented Jan 8, 2025

nex3 commented Jan 9, 2025 •

edited

Loading

ntkme commented Jan 9, 2025

nex3 commented Jan 9, 2025

ntkme commented Jan 9, 2025

RoelN commented Jan 9, 2025

nex3 commented Jan 11, 2025

RoelN commented Jan 15, 2025 •

edited

Loading

nex3 commented Jan 22, 2025

RoelN commented Jan 22, 2025

Treat private-use characters like non-printable characters for escaping #4015

Treat private-use characters like non-printable characters for escaping #4015

Comments

RoelN commented Jan 8, 2025 • edited by nex3 Loading

ntkme commented Jan 8, 2025 • edited Loading

ntkme commented Jan 8, 2025

nex3 commented Jan 9, 2025 • edited Loading

ntkme commented Jan 9, 2025

nex3 commented Jan 9, 2025

ntkme commented Jan 9, 2025

RoelN commented Jan 9, 2025

nex3 commented Jan 11, 2025

RoelN commented Jan 15, 2025 • edited Loading

nex3 commented Jan 22, 2025

RoelN commented Jan 22, 2025

RoelN commented Jan 8, 2025 •

edited by nex3

Loading

ntkme commented Jan 8, 2025 •

edited

Loading

nex3 commented Jan 9, 2025 •

edited

Loading

RoelN commented Jan 15, 2025 •

edited

Loading