- 2022-04-27: First draft
- 2024-07-21: Second draft
DRAFT
In order to move the Cosmos SDK to a system of decoupled semantically versioned modules which can be composed in different combinations (ex. staking v3 with bank v1 and distribution v2), we need to reassess how we organize the API surface of modules to avoid problems with go semantic import versioning and circular dependencies. This ADR explores various approaches we can take to addressing these issues.
There has been a fair amount of desire in the community for semantic versioning in the SDK and there has been significant movement to splitting SDK modules into standalone go modules. Both of these will ideally allow the ecosystem to move faster because we won't be waiting for all dependencies to update synchronously. For instance, we could have 3 versions of the core SDK compatible with the latest 2 releases of CosmWasm as well as 4 different versions of staking . This sort of setup would allow early adopters to aggressively integrate new versions, while allowing more conservative users to be selective about which versions they're ready for.
In order to achieve this, we need to solve the following problems:
- because of the way go semantic import versioning (SIV) works, moving to SIV naively will actually make it harder to achieve these goals
- circular dependencies between modules need to be broken to actually release many modules in the SDK independently
- pernicious minor version incompatibilities introduced through correctly evolving protobuf schemas without correct unknown field filtering
Note that all the following discussion assumes that the proto file versioning and state machine versioning of a module are distinct in that:
- proto files are maintained in a non-breaking way (using something like buf breaking to ensure all changes are backwards compatible)
- proto file versions get bumped much less frequently, i.e. we might maintain
cosmos.bank.v1
through many versions of the bank module state machine - state machine breaking changes are more common and ideally this is what we'd want to semantically version with
go modules, ex.
x/bank/v2
,x/bank/v3
, etc.
Consider we have a module foo
which defines the following MsgDoSomething
and that we've released its state
machine in go module example.com/foo
:
package foo.v1;
message MsgDoSomething {
string sender = 1;
uint64 amount = 2;
}
service Msg {
DoSomething(MsgDoSomething) returns (MsgDoSomethingResponse);
}
Now consider that we make a revision to this module and add a new condition
field to MsgDoSomething
and also
add a new validation rule on amount
requiring it to be non-zero, and that following go semantic versioning we
release the next state machine version of foo
as example.com/foo/v2
.
// Revision 1
package foo.v1;
message MsgDoSomething {
string sender = 1;
// amount must be a non-zero integer.
uint64 amount = 2;
// condition is an optional condition on doing the thing.
//
// Since: Revision 1
Condition condition = 3;
}
Approaching this naively, we would generate the protobuf types for the initial
version of foo
in example.com/foo/types
and we would generate the protobuf
types for the second version in example.com/foo/v2/types
.
Now let's say we have a module bar
which talks to foo
using this keeper
interface which foo
provides:
type FooKeeper interface {
DoSomething(MsgDoSomething) error
}
Imagine we have a chain which uses both foo
and bar
and wants to upgrade to
foo/v2
, but the bar
module has not upgraded to foo/v2
.
In this case, the chain will not be able to upgrade to foo/v2
until bar
has upgraded its references to example.com/foo/types.MsgDoSomething
to
example.com/foo/v2/types.MsgDoSomething
.
Even if bar
's usage of MsgDoSomething
has not changed at all, the upgrade
will be impossible without this change because example.com/foo/types.MsgDoSomething
and example.com/foo/v2/types.MsgDoSomething
are fundamentally different
incompatible structs in the go type system.
Now let's consider the reverse scenario, where bar
upgrades to foo/v2
by changing the MsgDoSomething
reference to example.com/foo/v2/types.MsgDoSomething
and releases that as bar/v2
with some other changes that a chain wants.
The chain, however, has decided that it thinks the changes in foo/v2
are too
risky and that it'd prefer to stay on the initial version of foo
.
In this scenario, it is impossible to upgrade to bar/v2
without upgrading
to foo/v2
even if bar/v2
would have worked 100% fine with foo
other
than changing the import path to MsgDoSomething
(meaning that bar/v2
doesn't actually use any new features of foo/v2
).
Now because of the way go semantic import versioning works, we are locked
into either using foo
and bar
OR foo/v2
and bar/v2
. We cannot have
foo
+ bar/v2
OR foo/v2
+ bar
. The go type system doesn't allow this
even if both versions of these modules are otherwise compatible with each
other.
A naive approach to fixing this would be to not regenerate the protobuf types
in example.com/foo/v2/types
but instead just update example.com/foo/types
to reflect the changes needed for v2
(adding condition
and requiring
amount
to be non-zero). Then we could release a patch of example.com/foo/types
with this update and use that for foo/v2
. But this change is state machine
breaking for v1
. It requires changing the ValidateBasic
method to reject
the case where amount
is zero, and it adds the condition
field which
should be rejected based
on ADR 020 unknown field filtering.
So adding these changes as a patch on v1
is actually incorrect based on semantic
versioning. Chains that want to stay on v1
of foo
should not
be importing these changes because they are incorrect for v1.
None of the above approaches allow foo
and bar
to be separate modules
if for some reason foo
and bar
depend on each other in different ways.
For instance, we can't have foo
import bar/types
while bar
imports
foo/types
.
We have several cases of circular module dependencies in the SDK (ex. staking, distribution and slashing) that are legitimate from a state machine perspective. Without separating the API types out somehow, there would be no way to independently semantically version these modules without some other mitigation.
Imagine that we solve the first two problems but now have a scenario where
bar/v2
wants the option to use MsgDoSomething.condition
which only foo/v2
supports. If bar/v2
works with foo
v1
and sets condition
to some non-nil
value, then foo
will silently ignore this field resulting in a silent logic
possibly dangerous logic error. If bar/v2
were able to check whether foo
was
on v1
or v2
and dynamically, it could choose to only use condition
when
foo/v2
is available. Even if bar/v2
were able to perform this check, however,
how do we know that it is always performing the check properly. Without
some sort of
framework-level unknown field filtering,
it is hard to know whether these pernicious hard to detect bugs are getting into
our app and a client-server layer such as ADR 033: Inter-Module Communication
may be needed to do this.
One solution (first proposed in cosmos#10582) is to isolate all protobuf
generated code into a separate module
from the state machine module. This would mean that we could have state machine
go modules foo
and foo/v2
which could use a types or API go module say
foo/api
. This foo/api
go module would be perpetually on v1.x
and only
accept non-breaking changes. This would then allow other modules to be
compatible with either foo
or foo/v2
as long as the inter-module API only
depends on the types in foo/api
. It would also allow modules foo
and bar
to depend on each other in that both of them could depend on foo/api
and
bar/api
without foo
directly depending on bar
and vice versa.
This is similar to the naive mitigation described above except that it separates the types into separate go modules which in and of itself could be used to break circular module dependencies. It has the same problems as the naive solution, otherwise, which we could rectify by:
- removing all state machine breaking code from the API module (ex.
ValidateBasic
and any other interface methods) - embedding the correct file descriptors for unknown field filtering in the binary
To solve 1), we need to remove all interface implementations from generated
types and instead use a handler approach which essentially means that given
a type X
, we have some sort of resolver which allows us to resolve interface
implementations for that type (ex. sdk.Msg
or authz.Authorization
). For
example:
func (k Keeper) DoSomething(msg MsgDoSomething) error {
var validateBasicHandler ValidateBasicHandler
err := k.resolver.Resolve(&validateBasicHandler, msg)
if err != nil {
return err
}
err = validateBasicHandler.ValidateBasic()
...
}
In the case of some methods on sdk.Msg
, we could replace them with declarative
annotations. For instance, GetSigners
can already be replaced by the protobuf
annotation cosmos.msg.v1.signer
. In the future, we may consider some sort
of protobuf validation framework (like https://github.com/bufbuild/protoc-gen-validate
but more Cosmos-specific) to replace ValidateBasic
.
To solve 2), state machine modules must be able to specify what the version of
the protobuf files was that they were built against. For instance if the API
module for foo
upgrades to foo/v2
, the original foo
module still needs
a copy of the original protobuf files it was built with so that ADR 020
unknown field filtering will reject MsgDoSomething
when condition
is
set.
The simplest way to do this may be to embed the protobuf FileDescriptor
s into
the module itself so that these FileDescriptor
s are used at runtime rather
than the ones that are built into the foo/api
which may be different. Using
buf build, go embed,
and a build script we can probably come up with a solution for embedding
FileDescriptor
s into modules that is fairly straightforward.
One challenge with this approach is that it places heavy restrictions on what can go in API modules and requires that most of this is state machine breaking. All or most of the code in the API module would be generated from protobuf files, so we can probably control this with how code generation is done, but it is a risk to be aware of.
For instance, we do code generation for the ORM that in the future could contain optimizations that are state machine breaking. We would either need to ensure very carefully that the optimizations aren't actually state machine breaking in generated code or separate this generated code out from the API module into the state machine module. Both of these mitigations are potentially viable but the API module approach does require an extra level of care to avoid these sorts of issues.
This approach in and of itself does little to address any potential minor
version incompatibilities and the
requisite unknown field filtering.
Likely some sort of client-server routing layer which does this check such as
ADR 033: Inter-Module communication
is required to make sure that this is done properly. We could then allow
modules to perform a runtime check given a MsgClient
, ex:
func (k Keeper) CallFoo() error {
if k.interModuleClient.MinorRevision(k.fooMsgClient) >= 2 {
k.fooMsgClient.DoSomething(&MsgDoSomething{Condition: ...})
} else {
...
}
}
To do the unknown field filtering itself, the ADR 033 router would need to use the protoreflect API to ensure that no fields unknown to the receiving module are set. This could result in an undesirable performance hit depending on how complex this logic is.
An alternative to addressing minor version incompatibilities as described above is disallowing new fields in existing protobuf messages. While this is more restrictive, it simplifies versioning and eliminates the need for runtime unknown field checking. In addition, this approach would simplify cross language communication with the proposed RFC 002: Zero Copy Encoding. So, while it is rather restrictive, it has gained a fair amount of support.
Although disallowing new fields may seem overly restrictive, there is a straightforward way to work around it using protobuf oneof
s. Because oneof
and enum
cases must get processed through a switch
statement, adding new cases is not problematic because any unknown cases can be handled by a default
clause. The router layer wouldn't need to do unknown field filtering for these because the switch
statement is a native way to do this. If we needed to add new fields to MsgDoSomething
from above and retain the possibility of adding more new fields in the future, we could do something like this:
message MsgDoSomethingWithOptions {
string sender = 1;
uint64 amount = 2;
repeated MsgDoSomethingOption options = 3;
}
message MsgDoSomethingOption {
oneof option {
Condition condition = 1;
}
}
New oneof
cases can be added to MsgDoSomethingOption
and this has a similar effect as adding new fields to MsgDoSomethingWithOptions
but no new fields are needed. A similar strategy is recommended for adding variadic options to golang functions in https://go.dev/blog/module-compatibility and expanded upon in https://commandcenter.blogspot.com/2014/01/self-referential-functions-and-design.html.
An alternate approach to solving the versioning problem is to change how protobuf code is generated and move modules
mostly or completely in the direction of inter-module communication as described
in ADR 033.
In this paradigm, a module could generate all the types it needs internally - including the API types of other modules -
and talk to other modules via a client-server boundary. For instance, if bar
needs to talk to foo
, it could
generate its own version of MsgDoSomething
as bar/internal/foo/v1.MsgDoSomething
and just pass this to the
inter-module router which would somehow convert it to the version which foo needs (ex. foo/internal.MsgDoSomething
).
Currently, two generated structs for the same protobuf type cannot exist in the same go binary without special
build flags (see https://developers.google.com/protocol-buffers/docs/reference/go/faq#fix-namespace-conflict).
A relatively simple mitigation to this issue would be to set up the protobuf code to not register protobuf types
globally if they are generated in an internal/
package. This will require modules to register their types manually
with the app-level level protobuf registry, this is similar to what modules already do with the InterfaceRegistry
and amino codec.
If modules only do ADR 033 message passing then a naive and non-performant solution for
converting bar/internal/foo/v1.MsgDoSomething
to foo/internal.MsgDoSomething
would be marshaling and unmarshaling in the ADR 033 router. This would break down if
we needed to expose protobuf types in Keeper
interfaces because the whole point is to try to keep these types
internal/
so that we don't end up with all the import version incompatibilities we've described above. However,
because of the issue with minor version incompatibilities and the need
for unknown field filtering,
sticking with the Keeper
paradigm instead of ADR 033 may be unviable to begin with.
A more performant solution (that could maybe be adapted to work with Keeper
interfaces) would be to only expose
getters and setters for generated types and internally store data in memory buffers which could be passed from
one implementation to another in a zero-copy way.
For example, imagine this protobuf API with only getters and setters is exposed for MsgSend
:
type MsgSend interface {
proto.Message
GetFromAddress() string
GetToAddress() string
GetAmount() []v1beta1.Coin
SetFromAddress(string)
SetToAddress(string)
SetAmount([]v1beta1.Coin)
}
func NewMsgSend() MsgSend { return &msgSendImpl{memoryBuffers: ...} }
Under the hood, MsgSend
could be implemented based on some raw memory buffer in the same way
that Cap'n Proto
and FlatBuffers so that we could convert between one version of MsgSend
and another without serialization (i.e. zero-copy). This approach would have the added benefits of allowing zero-copy
message passing to modules written in other languages such as Rust and accessed through a VM or FFI. It could also make
unknown field filtering in inter-module communication simpler if we require that all new fields are added in sequential
order, ex. just checking that no field > 5
is set.
Also, we wouldn't have any issues with state machine breaking code on generated types because all the generated
code used in the state machine would actually live in the state machine module itself. Depending on how interface
types and protobuf Any
s are used in other languages, however, it may still be desirable to take the handler
approach described in approach A. Either way, types implementing interfaces would still need to be registered
with an InterfaceRegistry
as they are now because there would be no way to retrieve them via the global registry.
In order to simplify access to other modules using ADR 033, a public API module (maybe even one remotely generated by Buf) could be used by client modules instead of requiring to generate all client types internally.
The big downsides of this approach are that it requires big changes to how people use protobuf types and would be a
substantial rewrite of the protobuf code generator. This new generated code, however, could still be made compatible
with
the google.golang.org/protobuf/reflect/protoreflect
API in order to work with all standard golang protobuf tooling.
It is possible that the naive approach of marshaling/unmarshaling in the ADR 033 router is an acceptable intermediate solution if the changes to the code generator are seen as too complex. However, since all modules would likely need to migrate to ADR 033 anyway with this approach, it might be better to do this all at once.
If the above solutions are seen as too complex, we can also decide not to do anything explicit to enable better module version compatibility, and break circular dependencies.
In this case, when developers are confronted with the issues described above they can require dependencies to update in sync (what we do now) or attempt some ad-hoc potentially hacky solution.
One approach is to ditch go semantic import versioning (SIV) altogether. Some people have commented that go's SIV
(i.e. changing the import path to foo/v2
, foo/v3
, etc.) is too restrictive and that it should be optional. The
golang maintainers disagree and only officially support semantic import versioning. We could, however, take the
contrarian perspective and get more flexibility by using 0.x-based versioning basically forever.
Module version compatibility could then be achieved using go.mod replace directives to pin dependencies to specific
compatible 0.x versions. For instance if we knew foo
0.2 and 0.3 were both compatible with bar
0.3 and 0.4, we
could use replace directives in our go.mod to stick to the versions of foo
and bar
we want. This would work as
long as the authors of foo
and bar
avoid incompatible breaking changes between these modules.
Or, if developers choose to use semantic import versioning, they can attempt the naive solution described above and would also need to use special tags and replace directives to make sure that modules are pinned to the correct versions.
Note, however, that all of these ad-hoc approaches, would be vulnerable to the minor version compatibility issues described above unless unknown field filtering is properly addressed.
An alternative approach would be to avoid protobuf generated code in public module APIs. This would help avoid the discrepancy between state machine versions and client API versions at the module to module boundaries. It would mean that we wouldn't do inter-module message passing based on ADR 033, but rather stick to the existing keeper approach and take it one step further by avoiding any protobuf generated code in the keeper interface methods.
Using this approach, our foo.Keeper.DoSomething
method wouldn't have the generated MsgDoSomething
struct (which
comes from the protobuf API), but instead positional parameters. Then in order for foo/v2
to support the foo/v1
keeper it would simply need to implement both the v1 and v2 keeper APIs. The DoSomething
method in v2 could have the
additional condition
parameter, but this wouldn't be present in v1 at all so there would be no danger of a client
accidentally setting this when it isn't available.
So this approach would avoid the challenge around minor version incompatibilities because the existing module keeper API would not get new fields when they are added to protobuf files.
Taking this approach, however, would likely require making all protobuf generated code internal in order to prevent
it from leaking into the keeper API. This means we would still need to modify the protobuf code generator to not
register internal/
code with the global registry, and we would still need to manually register protobuf
FileDescriptor
s (this is probably true in all scenarios). It may, however, be possible to avoid needing to refactor
interface methods on generated types to handlers.
Also, this approach doesn't address what would be done in scenarios where modules still want to use the message router.
Either way, we probably still want a way to pass messages from one module to another router safely even if it's just for
use cases like x/gov
, x/authz
, CosmWasm, etc. That would still require most of the things outlined in approach (B),
although we could advise modules to prefer keepers for communicating with other modules.
The biggest downside of this approach is probably that it requires a strict refactoring of keeper interfaces to avoid generated code leaking into the API. This may result in cases where we need to duplicate types that are already defined in proto files and then write methods for converting between the golang and protobuf version. This may end up in a lot of unnecessary boilerplate and that may discourage modules from actually adopting it and achieving effective version compatibility. Approaches (A) and (B), although heavy handed initially, aim to provide a system which once adopted more or less gives the developer version compatibility for free with minimal boilerplate. Approach (D) may not be able to provide such a straightforward system since it requires a golang API to be defined alongside a protobuf API in a way that requires duplication and differing sets of design principles (protobuf APIs encourage additive changes while golang APIs would forbid it).
Other downsides to this approach are:
- no clear roadmap to supporting modules in other languages like Rust
- doesn't get us any closer to proper object capability security (one of the goals of ADR 033)
- ADR 033 needs to be done properly anyway for the set of use cases which do need it
The current non-router based approach for inter-module communication is for a module to define a Keeper
interface and for a consumer module to define an expected keeper interface with a subset of the keeper's methods. Such an interface can allow one module to avoid a direct dependency on another module if no concrete types need to be imported from the other module. For instance, if we had a method DoSomething(context.Context, string, uint64)
as in foo.v1
, then a module calling DoSomething
would not need to import foo
directly. If, however, we had a struct parameter such as Condition
and the new method were DoSomethingV2(context.Context, string, uint64, foo.Condition)
then a calling module would generally need to import foo just to get a reference to the foo.Condition
struct.
Golang, however, supports both structural and nominal typing. Nominal typing means that two types equivalent if and only if they have the same name. Structural typing in golang means that two types are equivalent if they have the same structure and are unnamed. So if we defined Condition
nominally it might look like type Condition struct { Field1 string }
and if we defined it structurally it would look like type Condition = struct { Field1 string }
. If Condition
were defined structurally we could use the expected keeper approach and a calling module would not need to import foo
at all to define the DoSomethingV2
method in its expected keeper interface. Structural typing avoids the dependency problems described above.
We could actually extend this structural typing paradigm to protobuf generated code if we disallow adding new fields to existing protobuf messages. This would be required because two struct types are only identical if their fields are identical. If even the order of fields in a struct or the struct tags change, then golang considers the structs as different types. While this is fairly restrictive, it is under consideration for approach A) and RFC 002 as well and has gained a fair amount of support.
Small modifications to the existing pulsar code generator could potentially generate code that uses structural typing and moves the implementation of protobuf interfaces to wrapper types because unnamed structs can't define methods. Here's an example of what this might look like:
type MsgDoSomething = struct {
Sender string
Amount uint64
}
type MsgDoSomething_Message(*MsgDoSomething)
// MsgDoSomething_Message would actually implement protobuf methods
var _ proto.Message = (*MsgDoSomething_Message)(nil)
At least at the message layer, such an API wouldn't pose a problem because the transaction decoder does message decoding and modules wouldn't need to interact with the proto.Message
interface directly.
In order to avoid problems with the global protobuf registry, the structural typing generated code would only register message descriptors with the global registry but not register message types. This would allow two modules to generate the same protobuf types in different packages without causing a conflict. Because the types are defined structurally, they would actually be the same types but no direct import would be required. In order to ensure compatibility, when message descriptors are registered at startup, a check would be required to ensure that messages are identical (i.e. no new fields).
With this approach, a module bar
calling module foo
could either import foo
directly to get its types or generate its own set of compatible types for foo
s API. The dependency problem is essentially solved with this approach without needing any sort of special discipline around a separate API module. The main discipline would be around versioning protobuf APIs correctly and not adding new fields.
Taking this approach one step further, we could potentially even define APIs that unwrap the request and response types, i.e. DoSomething(context.Context, string, uint64) error
vs DoSomething(context.Context, MsgDoSomething) (MsgDoSomethingResponse, error)
. Or, APIs could even be defined directly in golang and the .proto
files plus marshaling code could be generated via go generate
. Ex:
package foo
import "context"
//go:generate go run github.com/cosmos/cosmos-proto/cmd/structproto
type Msg interface {
DoSomething(context.Context, string, uint64) error
}
While having the limitation of not allowing new fields to be added to existing structs, approach E) has the following benefits:
- unlike approach A), api types can be generated in the same go module, but direct imports can always be avoided
- generated client/server code could look more like regular go interfaces (without needing a set of intermediate structs)
- keeper interface defined in
.go
files could be turned into protobuf APIs (rather than needing to write.proto
files) - SDK modules could adopt go semantic versioning without any of the issues described above, achieving the initially stated goals of this ADR
There has been no decision yet, and the SDK has more or less been following approach C) and official adoption of 0ver as a policy has been discussed. The issue of decoupling modules, properly versioning protobuf types, avoiding breakage, and adopting semver continue to arise from time to time. The most serious alternatives under consideration currently are approaches A) and E). The remainder of this ADR has been left blank and will be filled in when and if there is further convergence on a solution.