Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3 file format #26

Open
Earthcomputer opened this issue Jan 17, 2025 · 11 comments
Open

V3 file format #26

Earthcomputer opened this issue Jan 17, 2025 · 11 comments

Comments

@Earthcomputer
Copy link

Earthcomputer commented Jan 17, 2025

Improvements over V2:

  • Ability to uninline constants globally (e.g. Mth.PI).
    • Also the ability to restrict this to the scope of a package, class, or method (e.g. ClientboundCustomPayloadPacket.MAX_PAYLOAD_SIZE).
  • Ability to uninline mathematical expressions as well as constants (e.g. Mth.PI / 3, LENGTH - 1).
  • Ability to specify the radix of integers (decimal, hex, binary or octal).
    • This is not strictly constant uninlining but still fits within the broad category of cleaning up constant literals in decompiled code.
  • Ability to specify whether an integer should be formatted as a char in decompiled code.

Example V3 file:

unpick v3

# apply to floats in the default group globally
const strict float
    3.1415927 = net.minecraft.util.Mth.PI
    1.0471976 = net.minecraft.util.Mth.PI / 3

# apply to strings in the default group globally
const String
    "1.21.4" = net.minecraft.SharedConstants.VERSION_STRING

# apply to ints in the default group within the scope of ClientboundCustomPayloadPacket
scoped class net.minecraft.network.protocol.common.ClientboundCustomPayloadPacket const int
    1048576 = net.minecraft.network.protocol.common.ClientboundCustomPayloadPacket.MAX_PAYLOAD_SIZE

# apply to ints with the group "ARGBColor"
const int ARGBColor
    format = hex # format unrecongized constants as hex
    0xffffffff = net.minecraft.util.CommonColors.WHITE
    0xff000000 = net.minecraft.util.CommonColors.BLACK

# apply the group "ARGBColor" to ints flowing into the 5th parameter of of GuiGraphics.fill
target_method net.minecraft.client.gui.GuiGraphics fill (IIIII)V
    param 4 ARGBColor

# apply the group "ARGBColor" to ints flowing from the return of ARGB.color
target_method net.minecraft.util.ARGB color (IIII)I
    return ARGBColor

# apply to ints with the group "SetBlockFlag"
flag int SetBlockFlag
    3 = net.minecraft.world.level.block.Block.UPDATE_ALL
    1 = net.minecraft.world.level.block.Block.UPDATE_NEIGHBORS

target_field net.minecraft.core.particles.ColorParticleOption color I ARGBColor

Specification

General notes

  • Any character which is "ignored" may be removed from the file without changing the semantic meaning of the file:
    • Except for the first line, which is the version marker, any # character and all subsequent characters until the next new line or the end of the file are ignored.
    • Blank lines (including the new lines that may terminate them) are ignored

Tokens

<Identifier> ::= [a-zA-Z_$] [a-zA-Z0-9_$]*
<Double> ::= [0-9]+ "." [0-9]+ ( ("e" | "E") ("+" | "-")? [0-9]+ )?
<Float> ::= <Double> ("f" | "F")
<Integer> ::= ([1-9] [0-9]*) | (("0x" | "0X") [0-9a-fA-F]+) | (("0b" | "0B") [01]+) | ("0" [0-7]+)
<Long> ::= <Integer> ("l" | "L")
<Char> ::= <JLS §3.10.4 Character Literals, with §3.3 Unicode Escapes being also considered as §3.10.6 Escape Sequences>
<String> ::= <JLS §3.10.5 String Literals, with §3.3 Unicode Escapes being also considered as §3.10.6 Escape Sequences>
<Indent> ::= <lookbehind NewLine> (" " | "\t")+
<NewLine> ::= "\n"   # or the platform new line sequence
<BitShiftLeft> ::= "<<"
<BitShiftRightUnsigned> ::= ">>>"
<BitShiftRight> ::= ">>"
<FieldDescriptor> ::= <JVMS §4.3.2 Field Descriptors>   # This token is a contextual token and is not parsed unless required by the grammar
<MethodDescriptor> ::= <JVMS §4.3.3 Method Descriptors>   # This token is a contextual token and is not parsed unless required by the grammar
<Other> ::= [^]

Whitespace is not part of the token stream unless it is part of the <Indent> token or <NewLine> token. Its only other purpose is to separate tokens that would otherwise be joined together as the same token. Otherwise it is ignored.

File structure

<UnpickV3File> ::= <VersionMarker> (<NewLine> <Item>)*
<VersionMarker> ::= "unpick" "v3"
<Item> ::= <GroupDefinition> | <TargetMethod> | <TargetField>

<GroupDefinition> ::= (<GroupScope>)? <GroupType> ("strict")? <DataType> (<Identifier>)? (<NewLine> <Indent> <GroupItem>)*
<GroupScope> ::= <GroupPackageScope> | <GroupClassScope> | <GroupMethodScope>
<GroupPackageScope> ::= "scoped" "package" <ClassName>
<GroupClassScope> ::= "scoped" "class" <ClassName>
<GroupMethodScope> ::= "scoped" "method" <ClassName> <MethodName> <MethodDescriptor>
<GroupType> ::= "const" | "flag"

<GroupItem> ::= <GroupFormat> | <GroupConst>
<GroupFormat> ::= "format" "=" <Format>
<Format> ::= "decimal" | "hex" | "binary" | "octal" | "char"

<GroupConst> ::= <GroupConstKey> "=" <Expression>
<GroupConstKey> ::= ("-"? (<Double> | <Integer>)) | <Char> | <String>

# Operator precedence is handled by the lexical structure, which results in the same precedence as in the Java language.
<Expression> ::= <BitOrExpression>
<BitOrExpression> ::= (<BitOrExpression> "|")? <BitXorExpression>
<BitXorExpression> ::= (<BitXorExpression> "^")? <BitAndExpression>
<BitAndExpression> ::= (<BitAndExpression> "&")? <BitShiftExpression>
<BitShiftExpression> ::= (<BitShiftExpression> (<BitShiftLeft> | <BitShiftRight> | <BitShiftRightUnsigned>))? <AdditiveExpression>
<AdditiveExpression> ::= (<AdditiveExpression> ("+" | "-"))? <MultiplicativeExpression>
<MultiplicativeExpression> ::= (<MultiplicativeExpression> ("*" | "/" | "%"))? <UnaryExpression>
<UnaryExpression> ::= <CastExpression> | (("-" | "~")? <PrimaryExpression>)
<CastExpression> ::= "(" <DataType> ")" <UnaryExpression>
<PrimaryExpression> ::= <ParenExpression> | <FieldExpression> | <LiteralExpression>
<ParenExpression> ::= "(" <Expression> ")"
<FieldExpression> ::= <ClassName> "." <Identifier> (":" "instance")? (":" <DataType>)?
<LiteralExpression> ::= <Double> | <Float> | <Integer> | <Long> | <Char> | <String>

<TargetMethod> ::= "target_method" <ClassName> <MethodName> <MethodDescriptor> (<NewLine> <Indent> <TargetMethodItem>)*
<TargetMethodItem> ::= <TargetMethodParam> | <TargetMethodReturn>
<TargetMethodParam> ::= "param" <Integer> <Identifier>
<TargetMethodReturn> ::= "return" <Identifier>

<TargetField> ::= "target_field" <ClassName> <Identifier> <FieldDescriptor> <Identifier>

<DataType> ::= "byte" | "short" | "int" | "long" | "float" | "double" | "char" | "String"
<ClassName> ::= (<Identifier> ".")* <Identifier>
<MethodName> ::= ("<" "init" ">") | ("<" "clinit ">") | <Identifier>

Semantics

Group Definitions

  • A group definition defines a group, or adds to its definition if the group is already defined.
  • If the group name is missing, then the group definition adds to the definition of the default group.
  • If the data type is int, or long, then the group type may be const or flag. If the data type is float, double, or String, then the group type may be only const. The data type cannot be byte, short, or char.
  • The default group must have const type.
  • It is an error for multiple group definitions of different types to have the same group name.
  • It is an error to define multiple different formats for the same group (name and scope pair).
  • It is an error for the same group (name and scope pair) to specify the same constant more than once.
    • If the same group specifies the same constant but at different scopes, the most specific scope takes precedence.
  • Group const keys must be strings if the data type of the group is String, integers or doubles if the data type of the group is float or double, and integers if the data type of the group is int or long.
  • If a group definition is strict, then the constant replacements contained within the group definition will only be applied to literals of that type. Otherwise, constant replacement will occur for all compatible literals.
    • A literal type is compatible with a group definition type if the group type is either the same type as the literal type, or could be converted to the literal type via a widening primitive conversion (JLS §5.1.2).

Expressions

  • The semantics of all operators are the same as in the Java language. The + operator handles both addition and string concatenation, as in Java. The cast operator cannot be used to convert between strings and other types.
  • Operators perform widening primitive conversion (JLS §5.1.2) even when casts aren't explicitly specified, like they do in Java. This includes unary operators on the byte, short, and char types.
  • If a widening conversion is applied to a literal expression, then the type of the literal expression is changed to the widened type during constant substitution.
  • It is an error to use an operator on a data type that it would be incompatible with in Java.
  • If the data type of a field expression is unspecified, it defaults to the data type of the group (not the data type of the field in reality). This is to allow the unpick file to be applied without reference to the actual Java field.
  • For non-static constant fields, an :instance suffix can be added to a field expression. The instance of a constant expression is retained in the bytecode and the field access is replaced with a null check. A bytecode implementation of unpick could implement non-static field uninlining by looking for the following pattern in bytecode:
    • In Java 8 and below:
      • invokevirtual java/lang/Object.getClass()Ljava/lang/Class; pop; <load constant>
    • In Java 9 and above:
      • invokestatic java/util/Objects.requireNonNull(Ljava/lang/Object;)Ljava/lang/Object; pop; <load constant>
    • and replacing it with a getfield instruction, which should preserve the instance which should have been loaded onto the stack already before the null check.

Targets

  • Target method parameters are indexed starting from 0.
    • Parameters are not indexed by their bytecode local variable index as with some mapping formats. All parameters take 1 index.
  • Targets must be a compatible type with the type of the group they specify.
    • byte, short, int, long, float, and double are all compatible with each other.
    • char is compatible with byte, short, int, and long.
    • String is only compatible with itself.

Class Names

  • Class names use a class' binary name (JLS §13.1), which uses . to separate package elements and $ to separate inner class names from outer class names. This format was chosen over the internal name to avoid potential confusion with the / division operator.

Link-time checking

Even though the V3 format is designed to be run without reference to the constant fields on the classpath, it may be useful to validate that unpick files will produce the expected outcome:

  • For each field expression, validate that it references an existing field of the (possibly implicit) type and staticness specified in the unpick file.
    • Also check that this field is final and is initialized to a compile time constant (JLS §15.28).
  • Evaluate each expression according to the rules of the Java language and check that they equal the group constant key.
    • For each occurrence of the / and % operators, validate that if both sides of the operator are integers, the right hand side does not evaluate to 0.
  • Even though target methods and fields are required to be present to apply an unpick file, it may be worth verifying their existence during these checks.

Application

Note that this section is only one example of how constant uninlining could be implemented with these files. Implementations are free to uninline in less or more places than specified here or to use different techniques. This section is meant to give a feel for how the file format is supposed to be interpreted and reasoned about, not to dictate how the implementation is supposed to look.

Following is a description of an algorithm to uninline constants in a method at the source code level. In practice it may be undesirable to only uninline a single method in isolation, due to the existence of inner methods (via anonymous or local classes).

Identify targets

Every expression and sub-expression in the syntax tree of the method body is associated to a group. In addition, every parameter and local variable is associated to a group. All expressions and variables are initially assigned to the default group. Then:

Enclosing method parameters

Search for if the enclosing method or any method it overrides or implements is a target method in the unpick file. For each target parameter, assign the group of that parameter in the method to the group specified in the unpick file.

Enclosing method return

Search for if the enclosing method or any method it overrides or implements is a target method in the unpick file. If the group of the return value of the target method is specified, then assign the group of every expression which is the argument of a return statement to the group specified in the unpick file.

Referenced fields

For every field reference in the method, search for if the referenced field is a target field in the unpick file. If it is, assign the group of the field reference expression to the group specified in the unpick file.

Referenced method parameters

For every method call expression and new expression in the method, search for if the referenced method or constructor is a target method in the unpick file, or if the referenced method overrides or implements a target method in the unpick file. For each parameter of that method call, assign the group of the expression passed as that parameter to the group specified in the unpick file, if present.

Referenced method returns

For every method call expression in the method, search for if the referenced method, or any method it overrides or implements, is a target method in the unpick file. If it is, and the group of the return value of the target method is specified, then assign the group of the method call expression to the group specified in the unpick file.

Pattern variables

For all variables declared as a pattern variable from destructuring a record, search for if the referenced field or accessor method is a target field or method in the unpick file. If it is, assign the group of the variable to the group specified in the unpick file.

Propagate groups

In this step, the non-default groups of expressions are repeatedly propagated to other expressions until there are no further propagations to apply. If propagation assigns a group to an expression, but that expression is already assigned to a different non-default group, this is a group conflict. How to handle group conflicts is up to the implementation, but throwing an error or warning or assigning an arbitrary group are possible implementations.

Variable propagation

If a variable is assigned a group, then that group is propagated to all references to that variable. If a variable reference is assigned a group, then that group is assigned to all other references to that variable and the variable itself.

Variable declaration propagation

The group of a variable and the expression it is assigned to as part of its declaration are propagated to each other.

Operator propagation

  • For the following binary operators, the group of the left hand side, right hand side, and the whole expression are propagated to the other two expressions: =, +=, -=, *=, &=, |= ^=, +, -, *, &, |, ^.
  • For the following binary operators, the group of the left hand side and the whole expression are propagated to each other: /=, %=, >>=, >>>=, <<=, /, %.
  • For the following binary operators, the group of the left and side and right and side are propagated to each other: ==, !=, >, <, >=, <=.
  • For the following unary operators, the group of the operator and the whole expression are propagated to each other: casts, +, -, ~, ++ (prefix and postfix), -- (prefix and postfix).
  • For the ternary operator, the type of the true expression, false expression, and the whole expression are propagated to the other two expressions.

Switch propagation

  • The group of the subject of a switch statement or expression is propagated to all the case labels of that switch statement or expression.
  • In switch expressions, the group of each expression value of a label, the expression argument to each yield statement, and the switch expression itself are propagated to each other.

Substitute constants

For each literal expression in the method body:

  • If the literal expression is directly inside a unary - expression, then apply the following steps on the unary expression rather than on the literal itself.
  • If the group type is const, or if the group type is flag and the literal is 0 or -1: if there are any substitutions (matching the scope of the current method) matching the value of the literal, then replace that literal expression with the expression specified by that substitution.
  • If the group type is flag and the literal is not 0 or -1:
    • Find the minimal set of substitutions (matching the scope of the current method) (flags may cover more than one bit, so the "minimal set" part is important here), where the constant keys of those substitutions, when bitwise ORed together, produce as many bits of the literal as possible; let this set be called the "positive set" and the leftover bits not produced be called the "residual".
    • Find the minimal set of substitutions (matching the scope of the current method), where the the constant keys of those substitutions, when bitwise ORed together, produce the bitwise inverse of the literal; let this set be called the "negative set", there may be no such set.
    • If the negative set exists, and either the residual is not 0 or the negative set is smaller than the positive set, then replace the literal expression with ~(expr1 | expr2 | ...) where expr1, expr2, etc are the expressions specified by the substitutions in the negative set.
    • Otherwise, replace the literal expression with expr1 | expr2 | ... | residual, where expr1, expr2, etc are the expressions specified by the substitutions in the positive set. Do not include the residual if it is 0, otherwise apply the format of the group to the residual if specified.
  • If the literal remains unsubstituted, apply the format of the group to the literal.
  • If the substituted expression is directly inside a cast expression, and the type (JLS §15) of the substituted expression is equal to the type of the cast expression, or is convertible to it via widening primitive conversion (JLS §5.1.2), then the cast may be replaced by the substituted expression. The cast cannot be replaced if it changes semantics, for example by changing the overload of a method call. A cast may also need to be added for the same reason.
@OroArmor
Copy link

I really like the scoped option! However it seems to be limited to just classes. Would it be possible to also extend this to methods?

I also don't see a way to attach a group to a field. That would be a nice addition to either support formatting or a case where a constant is inlined to a field (ie static final int SET_FLAGS = 3; in some class).

@apple502j
Copy link

Ability to mark all constants in a class (e.g. WorldEvents) as unpickable would be nice.

@Earthcomputer
Copy link
Author

Earthcomputer commented Jan 17, 2025

Ability to mark all constants in a class (e.g. WorldEvents) as unpickable would be nice.

Hmm, I'll think about it. The problem with this is it breaks the ability for the unpick implementation to be run without reference to the class the constants are in. Which we may decide to not be a big deal tbh but it's a nice to have.

An alternative would be to generate some of the unpick mappings in yarn, like you already kind of do for name suggestions. I was already considering this for SharedConstants.VERSION_STRING, which of course changes every version.

@Daomephsta
Copy link
Collaborator

Switch cases, field initialisers, and perhaps field assignments in general should also be possible to unpick.

  • Ability to specify the radix of integers (decimal, hex, binary or octal).
  • Ability to specify whether an integer should be formatted as a char in decompiled code.

While these are in-scope for Unpick, the current implementation of uninlining is bytecode level, so it cannot perform transformations that can only be done at the source level like this. Having features in the Unpick format that Unpick cannot uninline would definitely be confusing, so we'll need to figure out some resolution to this.

Ability to uninline string constants.
String constants have been supported since v1.

The data type cannot be byte, short, or char.

v2 supports this since #11 was merged, v3 should as well

Parameters are indexed by their bytecode local variable index.

If the target method is static, parameter indexes start at 0. If the target method is non-static, parameter indexes start at 1.
Parameter indexes increment by 2 for each long and double parameter, and by 1 for all other parameters.

Why? This is different from v2, making it harder to port v2 mappings to v3. It also makes manually writing unpick definitions harder.

@Earthcomputer
Copy link
Author

Switch cases, field initialisers, and perhaps field assignments in general should also be possible to unpick.

True, but this is not the responsibility of the file format.

Having features in the Unpick format that Unpick cannot uninline would definitely be confusing

Maybe, but the resolution could simply be that some (or even all) types of uninlining are optional. I am anticipating there to be multiple uninliner implementations, not just unpick itself, but we are planning one in vineflower too. Btw, it's also impossible to uninline switch cases and annotation values at the bytecode level.

String constants have been supported since v1.

Ah, I wasn't aware. Scopes and concatenation support will definitely make it more useful though.

The data type cannot be byte, short, or char.

v2 supports this since #11 was merged, v3 should as well

This is what I get for submitting this issue before I had fully fleshed it out. Tomorrow I am going to add much more detail about how unpick mappings are applied, but basically a group definition specifies how to transform a Java literal, and only a Java literal, into another expression. There are only int, long, float, double, and String literals at the bytecode level, and in addition only char at the source code level. It is still perfectly legal to specify a byte, short, or char method parameter, return type, or field (once I have amended the specification to include fields) to assign the group of another integer type.

Why? This is different from v2, making it harder to port v2 mappings to v3. It also makes manually writing unpick definitions harder.

Good point, I mistakenly assumed this was already how it worked in unpick and that I wasn't changing anything here, but it is just worse than the simpler way and these complexities should be left to the unpick bytecode implementation

@Daomephsta
Copy link
Collaborator

True, but this is not the responsibility of the file format.

They're kinds of uninlining that's useful, so ideally the file format should have some way to describe them. One option would be a "dumb" uninlining that replaces every use of a literal within the target method. I don't think this is a good way, but it clarifies that I'm not necessarily proposing specific support for switches in the file format.

Maybe, but the resolution could simply be that some (or even all) types of uninlining are optional. I am anticipating there to be multiple uninliner implementations, not just unpick itself, but we are planning one in vineflower too.

We're on the same page then. There are at least 3 potential projects that might implement uninliners; Unpick itself, Vineflower, and NeoForge; they all have different needs, and I'd like to meet them if we can. Benefits Unpick too, as it likely gets Unpick more contributors.

@Earthcomputer
Copy link
Author

Switch statements and expressions can be uninlined as part of propagation, see the updated description for an example of how I would expect an unpick file to be applied

@Daomephsta
Copy link
Collaborator

Updated spec looks good to me! 👍

Only other thing I can think of is that when documenting flag constants, it's worth explicitly mentioning that they can have multiple bits set (e.g. Block.NOTIFY_ALL). I'm not sure whether the substitution behaviour for such constants should be specified, but it's something I think is worth either specifying, or documenting as implementation dependent.

@Earthcomputer
Copy link
Author

This was implicitly covered by the description of how flags are applied (with the "minimal set" ORed together), but I added in a note explicitly specifying that they can have multiple bits set.

@Earthcomputer
Copy link
Author

I have added a syntax for non-static constants:

  • For non-static constant fields, an :instance suffix can be added to a field expression. The instance of a constant expression is retained in the bytecode and the field access is replaced with a null check. A bytecode implementation of unpick could implement non-static field uninlining by looking for the following pattern in bytecode:
    • In Java 8 and below:
      • invokevirtual java/lang/Object.getClass()Ljava/lang/Class; pop; <load constant>
    • In Java 9 and above:
      • invokestatic java/util/Objects.requireNonNull(Ljava/lang/Object;)Ljava/lang/Object; pop; <load constant>
    • and replacing it with a getfield instruction, which should preserve the instance which should have been loaded onto the stack already before the null check.

@Earthcomputer
Copy link
Author

The V3 file format implementation will solve issues #9, #10, #12 and #14. I'm not sure if #8 was already solved but if not this should solve that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants