Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sequencer)!: rewrite check_tx to be more efficient and fix regression #1515

Merged
merged 17 commits into from
Oct 4, 2024

Conversation

Lilyjjo
Copy link
Contributor

@Lilyjjo Lilyjjo commented Sep 18, 2024

Summary

Previously handle_check_tx()'s would always run all stateless checks and attempt insertion. This PR modified the function to only run the checks on transactions that have not been removed or are not contained inside of the app side mempool.

This PR additionally adds a HashSet of transaction hashes that are contained in the app's Mempool to the state of the app Mempool. This is used to enable seeing if a transaction is contained in the Mempool without having to decode it into a SignedTransaction. This will be kept in sync with unit tests which check if the length of the mempool (which is now calculated with the HashSet) matches the expected state of mutating the underlying containers.

We considered using just a toggle on CheckTX::New and CheckTX::Recheck instead of the HashSet, but there is the possibility of CometBFT persisting transactions on restart which could make use of toggling on the app mempool knowing the transaction instead. TBD, but it's also useful to be able to query the mempool if it contains a transaction.

This PR also restores the original behavior of CheckTx to not kick out transaction needlessly on rechecks, this regression was added when upgrading the mempool in #1323. Pre #1323, re-inserting the same transaction into the mempool would not cause an error. Post #1323, the re-insertion would cause an error. We now check to make sure that the transaction is not inside of the mempool before attempting to re-insert it.

Background

A lot of the functionality from the original handle_check_tx() has moved into the app side mempool itself, including the balance and nonce checks. All of the remaining checks will not change on CheckTx::Recheck and only need to be ran when considering a transaction for insertion into the appside mempool.

Testing

Ran locally and added unit tests to ensure future nonces can be added and rechecks do not remove transactions needlessly.

Metrics

  • Deleted check_tx_removed_stale_nonce
  • Added check_tx_duration_seconds_check_tracked, mempool_tx_logic_error
  • Changed check_tx_duration_seconds_check_nonce to check_tx_duration_seconds_fetch_nonce

Breaking Changes

Added new ABCI error codes for transaction insertion errors that would be useful to surface to the end user: ALREADY_PRESENT, NONCE_TAKEN, ACCOUNT_SIZE_LIMIT.

Related Issues

closes #1481

@github-actions github-actions bot added the sequencer pertaining to the astria-sequencer crate label Sep 18, 2024
@Lilyjjo Lilyjjo changed the title separate out checks sequencer, mempool: rewrite check_tx to be more efficient Sep 18, 2024
@Lilyjjo Lilyjjo changed the title sequencer, mempool: rewrite check_tx to be more efficient chore: sequencer, mempool: rewrite check_tx to be more efficient Sep 18, 2024
@Lilyjjo Lilyjjo changed the title chore: sequencer, mempool: rewrite check_tx to be more efficient chore: sequencer: rewrite check_tx to be more efficient Sep 18, 2024
@Lilyjjo Lilyjjo changed the title chore: sequencer: rewrite check_tx to be more efficient fix(sequencer): rewrite check_tx to be more efficient and fix regression Sep 18, 2024
@Lilyjjo Lilyjjo force-pushed the lilyjjo/check-tx-rewrite branch 3 times, most recently from 577d0c4 to 0de80ec Compare September 24, 2024 18:19
@Lilyjjo Lilyjjo marked this pull request as ready for review September 24, 2024 19:17
@Lilyjjo Lilyjjo requested a review from a team as a code owner September 24, 2024 19:17
@Lilyjjo Lilyjjo requested a review from Fraser999 September 24, 2024 19:17
@Lilyjjo Lilyjjo force-pushed the lilyjjo/check-tx-rewrite branch from 0a4f435 to 56ff638 Compare September 26, 2024 12:57
@Lilyjjo Lilyjjo requested review from noot and SuperFluffy September 26, 2024 13:06
parked.add(
timemarked_tx,
match parked.add(
timemarked_tx.clone(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can do let id = timemarked_tx.id() before this, then use that later on in the function and remove the clone()

Comment on lines 153 to 155
if let Some(rsp) = insert_into_mempool(mempool, &state, signed_tx, metrics).await {
return rsp;
}
Copy link
Collaborator

@noot noot Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function returning an Option that's only set if there's an error response is somewhat confusing, would prefer for this function to return a Result<(), response::CheckTx> so the error variant is the error response

same for check_removed_comet_bft

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it, feels much better that way

Comment on lines 259 to 238
/// Returns a [`response::CheckTx`] if the transaction fails any of the checks.
/// Otherwise, it returns the [`SignedTransaction`] to be inserted into the mempool.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a Result<SignedTransaction, response::CheckTx> as this function's return

@noot
Copy link
Collaborator

noot commented Sep 26, 2024

I think this PR looks good, just a few smaller comments!

This PR additionally adds a HashSet of transaction hashes that are contained in the app's Mempool to the state of the app Mempool. This is used in the scenario where the sequencer restarts-- CometBFT persists transactions but our app mempool does not. The cache is used to allow transactions to re-enter the mempool on restarts.

how does this work? with cometbft recheck?

@Lilyjjo
Copy link
Contributor Author

Lilyjjo commented Sep 26, 2024

This PR additionally adds a HashSet of transaction hashes that are contained in the app's Mempool to the state of the app Mempool. This is used in the scenario where the sequencer restarts-- CometBFT persists transactions but our app mempool does not. The cache is used to allow transactions to re-enter the mempool on restarts.

how does this work? with cometbft recheck?

Hmmm. I think I misheard why we wanted to allow rechecks to add to the appside mempool. I checked out the CometBFT code and didn't see it persisting the transactions anywhere that was accessible, (I wonder if I missed it). In the scenario where CometBFT doesn't persist, then the allowing recheck would only be useful if CometBFT doesn't crash if the sequencer app crashses. I'll do some more research

crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/mempool/mod.rs Show resolved Hide resolved
crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/service/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/service/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/service/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/service/mempool/mod.rs Outdated Show resolved Hide resolved
@Lilyjjo Lilyjjo requested a review from Fraser999 September 28, 2024 17:38
@Lilyjjo Lilyjjo force-pushed the lilyjjo/check-tx-rewrite branch from 345fcab to af24538 Compare September 30, 2024 15:06
crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved
crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 2, 2024
@Lilyjjo Lilyjjo added this pull request to the merge queue Oct 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 2, 2024
@Lilyjjo Lilyjjo changed the title fix(sequencer): rewrite check_tx to be more efficient and fix regression fix(sequencer)!: rewrite check_tx to be more efficient and fix regression Oct 2, 2024
@Lilyjjo Lilyjjo changed the title fix(sequencer)!: rewrite check_tx to be more efficient and fix regression fix(sequencer): rewrite check_tx to be more efficient and fix regression Oct 2, 2024
Copy link
Member

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm blocking this so I can have a second pass over this.

Copy link
Member

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gone over the PR and left some comments. Reading the PR description this patch attempts to tackle 3 things at the same time:

  1. avoid rerunning expensive checks that have already been performed once. I believe this new functionality is covered by the changes in crate::service::mempool. It is enabled by the functions check_removed_comet_bft and is_tracked. The is_tracked makes use of the new contained_txs hashset. Question: is this an optimization to avoid using pending and parked mempool fields? Clearly the mempool needed to track the transactions contained therein before - so what's the reason to have an extra index of transactions? It would be nice to see that explained in the patch description. It would furthermore be useful to understand how this new index and the previous caches will be kept in sync.

  2. fix a regression for CheckTxKind::Recheck. Grepping this PR I only find recheck mention in tests. Looking at the fn handle_check_tx function the request::CheckTx is immediately deconstructed and the CheckTxKind::Recheck never actively considered. It hence looks to me as if we don't discriminate between checks of new transactions and rechecked of transactions existing in the mempool. Given that this is a regression called out in both the patch title and text, it would be nice to have this explained.

  3. introduce a way to maintain state on sequencer restarts. This part seems to be completely orthogonal to the goals 1. and 2. All related code should be removed from this patch and moved into a new PR. Having said that, it seems that the index of checked transactions (the hashset mentioned in the PR description) was introduced to enable the main goal of the PR (namely skipping unnecessary checks summarized under 1.). Enabling restart logic thus seems to be a useful side effect? Either way, it is confusing to the reviewer in how far this should be made part of the patch, and why it wasn't moved to a new patch. The patch description should be amended accordingly and better explained.


The semantics of adding or removing transactions from the mempool-tracked txs is weird. Looking at the code that makes use of add_from_contained_txs and remove_from_contained_txs it looks like you would be better served by creating a custom guard like this (I have intentionally left out the metrics stuff to keep the example minimal; add them as necessary):

struct TxLock<'a> {
    mempool: &'a Mempool,
    txs: RwLockWriteGuard<'a, HashSet<[u8; 32]>>,
}

impl<'a> TxLock<'a> {
    fn add(&mut self, id: [u8; 32]) {
        self.tx.insert(id);
    }
    fn remove(&mut self, id: [u8; 32]) {
        self.tx.remove(id);
    }
}

impl Mempool {
    async fn lock_txs(&self) -> TxLock<'_> {
        TxLock {
            mempool: self,
            txs: mempool.contained_txs.write().await,
        }
    }
}

crates/astria-sequencer/src/mempool/mod.rs Outdated Show resolved Hide resolved

/// Removes a transaction from the mempool's tracked transactions.
/// Will increment logic error metrics and log error if transaction is not present.
fn remove_from_contained_txs(&self, tx_hash: [u8; 32], contained_txs: &mut HashSet<[u8; 32]>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My arguments for the mempool bits above also apply here: this is not removing a transactions from the mempool tracked transactions. It's mutating an argument. This method seems to be a vehicle to access the metrics.

Change or remove please.

}

#[tokio::test]
async fn receck_adds_non_tracked_tx() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a spelling issue here.

@@ -1,3 +1,6 @@
#[cfg(test)]
mod tests;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually move this below the imports please. We are following the official Rust style guide: https://doc.rust-lang.org/stable/style-guide/items.html


// tx is valid, push to mempool with current state
// generate address for the signed transaction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This address construction is not necessary I believe (but fixing it is the scope of this PR).

Please remove this comment and create a followup issue to just us the address bytes to fetch all relevant information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked: #1620

}
RemovalReason::LowerNonceInvalidated => {
return Err(response::CheckTx {
code: Code::Err(AbciErrorCode::LOWER_NONCE_INVALIDATED.value()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent behind info was to give a human readable description of the abci code. So please use AbciErrorCode::LOWER_NONCE_INVALIDATED.to_string() or, if you need a more granular reason, add a new abci code.

Alternatively, provide more background information to the log line.

info: "transaction removed from app mempool due to lower nonce being \
invalidated"
.into(),
log: "Transaction removed from app mempool due to lower nonce being \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower case. Also, log and info seem to provide the same information. That doesn't seem useful.

return Err(response::CheckTx {
code: Code::Err(AbciErrorCode::INVALID_NONCE.value()),
info: "transaction removed from app mempool due to stale nonce".into(),
log: "Transaction from app mempool due to stale nonce".into(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lower case - also info and log seem to have the same text. Doesn't seem useful.

if let Err(error) =
pending.add(ttx, current_account_nonce, &current_account_balances)
{
// remove from tracked
// note: this branch is not expected to be hit so grabbing the lock inside
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convention is to use upper case XXX, TODO, FIXME, NOTE etc for comments like this.

Suggested change
// note: this branch is not expected to be hit so grabbing the lock inside
// NOTE: this branch is not expected to be hit so grabbing the lock inside

if let Err(error) =
pending.add(ttx, current_account_nonce, &current_account_balances)
{
// remove from tracked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If code doing something as simple as this needs to be commented, then its either not self evident (leave a comment tagged with XXX or NOTE, see below), or adjust the called methods to make it evident.

tx_hash: [u8; 32],
mempool: &AppMempool,
metrics: &Metrics,
) -> Result<(), response::CheckTx> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just return a RemovalReason here and instead implement a new trait IntoCheckTxResponse on it so that you don't need to do the conversion inline.

state: &S,
signed_tx: SignedTransaction,
metrics: &'static Metrics,
) -> Result<(), response::CheckTx> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noted it for RemovalReason, but it also applies here: all the error-to-response conversions could be made a lot neater by adding a trait IntoCheckTxResponse and moving the conversions there instead of having them inlined in the function body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Wrote the trait and defined if for RemovalReason and InsertionError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, neat.

@Lilyjjo Lilyjjo changed the title fix(sequencer): rewrite check_tx to be more efficient and fix regression fix(sequencer)!: rewrite check_tx to be more efficient and fix regression Oct 4, 2024
Copy link
Member

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks neat, thank you for your work.

I have made some notes on how you can now use your newly defined ContainedTxLock to avoid accidentally holding on to the lock for too long (which is an actual benefit of the design!).

Please also have a look at how composer (and bridge-withdrawer) might be affected by the nonce-related ABCI codes. Maybe create issues for followup tasks.

Happy to merge this after updating the PR description with respect to my last review.

@@ -52,6 +55,13 @@ impl AbciErrorCode {
}
Self::LOWER_NONCE_INVALIDATED => "lower nonce was invalidated in mempool".into(),
Self::BAD_REQUEST => "the request payload was malformed".into(),
Self::ALREADY_PRESENT => "the transaction is already present in the mempool".into(),
Self::NONCE_TAKEN => "there is already a transaction with the same nonce for the \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if NONCE_TAKEN and LOWER_NONCE_INVALIDATED would have any repercussions on composer? Right now composer acts on INVALID_NONCE, refetching the latest nonce and then resubmitting at the latest nonce + 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. The NONCE_TAKEN error case would case the bundle factory to not resend when it should. The ALREADY_PRESENT one isn't an error, it means that the bundle is already inside of the mempool. I should change those to reflect these changes probably

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked issue: #1633

Comment on lines 427 to 428
let mut contained_lock = self.lock_contained_txs().await;
contained_lock.remove(tx_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your new ContainedTxLock type, you can now do something very neat that avoids holding on to the lock at all:

self.lock_contained_txs()
    .await
    .remove(tx_id);

This means that the lock is dropped at the end of the statement since you never bind it anywhere.

@@ -1000,4 +1090,65 @@ mod tests {
"first removal reason should be presenved"
);
}

#[tokio::test]
async fn tx_tracked_edge_cases() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I now already have a better understanding what this is for. Ideally each edge case gets its own test (even at the cost of more code) so that it's immediately clear what you are testing.

state: &S,
signed_tx: SignedTransaction,
metrics: &'static Metrics,
) -> Result<(), response::CheckTx> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, neat.

app::StateReadExt as _,
bridge::StateReadExt as _,
ibc::StateReadExt as _,
};

#[instrument(skip_all)]
pub(crate) async fn check_nonce_mempool<S: StateRead>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like for all of these to migrate over to the mempool at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue #1632

@Lilyjjo Lilyjjo added this pull request to the merge queue Oct 4, 2024
github-merge-queue bot pushed a commit that referenced this pull request Oct 4, 2024
…sion (#1515)

## Summary
Previously `handle_check_tx()`'s would always run all stateless checks
and attempt insertion. This PR modified the function to only run the
checks on transactions that have not been removed or are not contained
inside of the app side mempool.

This PR additionally adds a HashSet of transaction hashes that are
contained in the app's Mempool to the state of the app Mempool. This is
used to enable seeing if a transaction is contained in the Mempool
without having to decode it into a `SignedTransaction`. This will be
kept in sync with unit tests which check if the length of the mempool
(which is now calculated with the HashSet) matches the expected state of
mutating the underlying containers.

We considered using just a toggle on `CheckTX::New` and
`CheckTX::Recheck` instead of the HashSet, but there is the possibility
of CometBFT persisting transactions on restart which could make use of
toggling on the app mempool knowing the transaction instead. TBD, but
it's also useful to be able to query the mempool if it contains a
transaction.

This PR also restores the original behavior of `CheckTx` to not kick out
transaction needlessly on rechecks, this regression was added when
upgrading the mempool in #1323. Pre #1323, re-inserting the same
transaction into the mempool would not cause an error. Post #1323, the
re-insertion would cause an error. We now check to make sure that the
transaction is not inside of the mempool before attempting to re-insert
it.

## Background
A lot of the functionality from the original `handle_check_tx()` has
moved into the app side mempool itself, including the balance and nonce
checks. All of the remaining checks will not change on
`CheckTx::Recheck` and only need to be ran when considering a
transaction for insertion into the appside mempool.

## Testing
Ran locally and added unit tests to ensure future nonces can be added
and rechecks do not remove transactions needlessly.

## Metrics
- Deleted `check_tx_removed_stale_nonce`
- Added `check_tx_duration_seconds_check_tracked`,
`mempool_tx_logic_error`
- Changed `check_tx_duration_seconds_check_nonce` to
`check_tx_duration_seconds_fetch_nonce`

## Breaking Changes
Added new ABCI error codes for transaction insertion errors that would
be useful to surface to the end user: `ALREADY_PRESENT`, `NONCE_TAKEN`,
`ACCOUNT_SIZE_LIMIT`.

## Related Issues
closes #1481

---------

Co-authored-by: Fraser Hutchison <[email protected]>
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Oct 4, 2024
@Lilyjjo Lilyjjo added this pull request to the merge queue Oct 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Oct 4, 2024
@Lilyjjo Lilyjjo added this pull request to the merge queue Oct 4, 2024
Merged via the queue into main with commit 0238e67 Oct 4, 2024
44 checks passed
@Lilyjjo Lilyjjo deleted the lilyjjo/check-tx-rewrite branch October 4, 2024 20:41
steezeburger added a commit that referenced this pull request Oct 7, 2024
* main:
  refactor(sequencer): generate `SequencerBlock` after transaction execution in proposal phase (#1562)
  fix(sequencer)!: rewrite check_tx to be more efficient and fix regression (#1515)
  fix(charts): hermes monitoring and prometheus rules  (#1564)
github-merge-queue bot pushed a commit that referenced this pull request Oct 14, 2024
## Summary
Updates composer to work with the new appside mempool. New ABCI error
codes were added and now nonces can be stacked, so the submission logic
should use the GRPC `pending_nonce()` instead of the RPC
`get_latest_nonce()`.

## Background
New ABCI error codes were added in PR
#1515 which need to be handled
in the composer's submission logic.

New Error codes:
- `NONCE_TAKEN` -> try resubmitting with new top-of-pending. If this is
happening because the composer is out of funds, that is okay as the
composer has it's own fund monitoring logic
- `PARKED_FULL` -> normal error case ok
- `ACCOUNT_SIZE_LIMIT` -> normal error case ok 
- `ALREADY_PRESENT` -> normal error case ok, shouldn't be returned 

## Background
We changed the internals of the mempool which resulted in the users of
the mempool to have changes as well.

## Testing
Smoke tests

## Related Issues
closes #1633

---------

Co-authored-by: quasystaty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sequencer pertaining to the astria-sequencer crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sequencer/mempool: CheckTX fails all rechecks
4 participants