Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Checklist #1

Closed
89 tasks done
nfagerlund opened this issue Mar 16, 2024 · 4 comments
Closed
89 tasks done

The Checklist #1

nfagerlund opened this issue Mar 16, 2024 · 4 comments

Comments

@nfagerlund
Copy link
Owner

nfagerlund commented Mar 16, 2024

  • Re-implement whole database layer
  • Port existing DB tests
  • Refine user extension/extractor type
    • (I didn't like the AuthUser abstraction from the prototype.)
  • Implement Session/Token auth middlewares
  • Port remaining templates
  • Bookmarklet templater+cruncher
  • Generalize a convenient error response type
  • Config file and CLI opts parser
  • Change some DB access behaviors to follow some of the best practices from https://kerkour.com/sqlite-for-servers
    • explicit 5s busy_timeout
    • Transaction mode BEGIN IMMEDIATE...
      • Uh... oh. sqlx doesn't support this.
      • OK, so based on my reading, this matters when you're doing a transaction where a write comes after a read. The DEFERRED transaction mode will start out as a read transaction, attempt to upgrade itself to a write transaction, and the whole thing will fail with a busy if there's another writer active at that point, bypassing the requested busy_timeout behavior. Nasty!
      • But, I'm not actually using multi-statement transactions atm.
      • I should be, though! In a couple places!
      • Those read-only ones pretty much don't matter (they're counts)... The write ones actually might not matter either, because they're just bumping the expiry date or the last-used date. I think I'm actually fine with doing those on separate DB connections instead of hogging the write connection for the duration.
      • But if I was doing something more complex in transactions, I'd be in a pickle 🤨
    • Split writer/reader pools in the Db struct:
      • max 1 conn for writers,
      • max (cpu count) conns for readers? or should it be more in case you hold one across an await point?
    • synchronous::normal
    • tweak cache size?? I can't find this in the pool or db opts, and I don't have evidence yet that it's needed.
    • PRAGMA temp_store = memory; (set it in the connect opts)
  • Migration handling
    • Build migrations exec into own CLI, so we don't need sqlx CLI on deployment target
    • Check for up-to-date migrations before serving, and bail with error if DB is wrong
  • Fix db routes that return mixture of sqlx error and user error, to disambiguate 4xx and 5xx problems
    • Users::create (unique constraint violation)
    • Dogears::create (")
    • Dogears::update (url parse)
    • Dogears::list (offset calculation)
    • Tokens::list (")

Implement routes

✅ all done!
  • /api/v1 -- API json routes:
    • POST /create
      • scope: write or manage
      • body: {prefix, current, display_name}
      • 201+content or 400+{error: "text"}
    • POST /update
      • scope: write or manage
      • body: {current}
      • Only CORS-enabled endpoint!
        • if cors: validate Origin header matches requested update.
      • 200+content or 404+empty
    • GET /list
      • scope: manage
      • query: page,size (optional)
      • 200+content or 400+{error: "text"}
    • DELETE /dogear/:id
      • scope: manage
      • 204 or 404, empties
      • also, fix the client delete button to say something useful on 404?
  • POST /login
    • yeah. this will all work slightly different.
    • redirect to returnTo
  • POST /logout
    • yeah
    • user required
    • redirect to /
  • POST /signup
    • body: {new_username, new_password, new_password_again, email}
    • user forbidden (403 if logged in)
    • 400 if passwords didn't match, 400 if create fails (?? idk)
    • create login session if success,
    • redirect to /
  • POST /changepassword
    • user required (401+text)
    • body: {password, new_password, new_password_again}
    • validate length and matching new passwords
    • re-authenticate old password
    • 400 (bad fields) or 403 (bad old pass) or 500 (on db fail)
  • GET /account
    • user required (401+text)
    • query: page,size (optional)
      • probably have a standard query deserialize type here.
    • list tokens
  • GET /fragments/tokens
    • user required (404)
    • query: page,size (optional)
  • DELETE /tokens/:id
    • user required (401)
    • 204+empty or 400+text
      • should probably switch on error types (not found vs 500 db error)
  • GET /status
    • 204+empty
  • GET /
    • heyyy
    • user OR serve login page.
    • query: page,size optional
  • GET /fragments/dogears
    • user required or 404
    • query: page,size optional
  • GET /install
    • user optional, but always serve page
    • bookmarklets!
  • POST /fragments/personalmark
    • user required (404)
  • GET /faq
  • GET /mark/:url
    • user OR serve login page with returnTo.
    • One of:
      • Update existing dogear in slowmode
      • Error from failing to update existing dogear (or check its existence)
      • Create new dogear (render create form)
  • POST /mark
    • user required (redirect to /)
    • body: {prefix, current, display_name}
    • Oh. This one's a "plain" form vulnerable to CSRF, actually. Better get that.
  • GET /resume/:url
    • user OR serve login page with returnTo
    • one of:
      • find, then redirect to current
      • not found: render create
      • db error: error

Port route tests

✅ all done!
  • Port existing application/route tests (once some routes are on-line)
    • API routes: test json return formats and status codes
      • api: list
      • api: delete
      • api: create
      • api: update
    • CORS: most API routes are nope, except update.
    • Unauthenticated:
      • most things 401, except login and signup and some info pages.
      • Some pages turn into login page!
      • Some things 404 instead of 401? maybe.
    • post signup
    • post login
    • post logout
    • /public
    • /account
      • and /fragments/tokens
    • index (pagination, expected content)
    • /faq
    • /install
    • /fragments/dogears
    • post /fragments/personalmark
    • /mark/:url (marked on success, create on fail
    • post /mark
    • /resume:url
    • post /changepassword
    • post /change_email
    • post /delete_account
    • delete /tokens/:id (never tested in old)

Downhill

  • Background task to delete stale sessions
  • Graceful shutdown wired through the whole multi-prong runtime
  • Multi-modal server
    • Fix graceful shutdown in busride-rs
  • DB backup cron job (to B2?) and retention policy
    • After considering the nature of the data, I think 30 days of daily backups and 12 months of monthly backups should be sufficient. Losing a day of reading progress is just not that catastrophic, in the context of the total infrastructure meltdown that would be necessary to make it occur.
    • It looks like bucket-level file retention policies can do this rotation for me automatically, if I just repeatedly upload files to the same filenames (eardogger-prod-db-daily.db.gz, e.g.). 🤔
    • The part I'm still not sure about is validation. I think we want to periodically check that the backup is fresher than a certain age, download and unpack it, check the format validity, and select a piece of signal data out of it. ...and then maybe ping a dead-man's switch? That last part has to touch something that's off-server and is able to email me. Hmmmmmmm.
  • Deploy to https subdomain in scrub-prod mode, perform long soak test
  • Logfile appender for tracing subscriber
  • Build improvements for sqlx macros (pre-baked data, build script)
  • Release scripting (tar up the binary, public, migrations, example config, readme, example htaccess) (github releases??)
  • Look into making busriders debug events less spammy
    • Call was coming from inside the house. (instrument on route handlers without skip_all was cramming the dogstate into span attrs, and we're several spans deep when a sqlx::query debug event occurs.)
  • Data import scripting from eardogger1 postgres DB
    • (make it bi-directional in case we need to bail out and revive the old one.)
  • Do local fix-and-patch for tracing-appender's "never rollover logs for short-lived apps" bug.
    • This is tokio-rs/tracing bug 2937 (not linking, to reduce cross-traffic noise in foreign repos). There's a PR, and I commented on it to see if the person will make improvements. But I don't feel like waiting on that.
      • It only prunes log files over the maximum count if the app 1. lives through a log rotation boundary and 2. subsequently tries to emit a log. When running in Scavenger Mode, we'll only meet that requirement if someone happens to make requests to the site during the 5m window around midnight UTC. 😆
  • Improve the README until it's at least as good as v1's was
    • Needs infrastructure / operation / deployment notes, especially. Existing stuff's in a jumble bc I was in a hurry.
@nfagerlund
Copy link
Owner Author

nfagerlund commented Apr 28, 2024

OK, as for graceful shutdown, it looks like the story goes something like this, courtesy of the tokio docs topic:

  • Make both a cancellation token and a task tracker.
  • Anything that might need to spawn background tasks needs a clone of both of those.
    • The task itself needs a clone of the token, and needs to select! on it in its core loop so it can bail early instead of running to completion if needs be.
    • The task should get spawned via the tracker.
  • The server needs to stop accepting connections and finish out any WIP ones.
    • If using axum::serve, call .with_graceful_shutdown() on it and pass a clone of the cancel token as the shutdown future.
    • If using busride-rs, ...ah, okay, actually I need to refactor this, because it's currently just grabbing ownership of the cancellation logic and I want it to accept an arbitrary shutdown future instead, like axum::serve does. But anyway, thejokr/fastcgi-server's async runner has built-in facility for this and busride-rs already calls it.
      • It looks like we don't need to propagate the cancellation token into the tasks busride-rs spawns to handle requests, because of this.
  • In order to decide to cancel, the outermost program loop someone, probably a spawned task outside the tracker, needs to do a select! on signal handlers for SIGINT (ctrl-c), SIGTERM (plain kill cmd, also sent by mod_fcgid as its warning shot before falling back to SIGKILL) and call token.cancel() in response. You'll also want to respond with cancel to anything else you want to interpret as a shutdown.
  • Once you decide to cancel you need to tracker.close()... then you need to wait. Specifically, you need to tracker.wait().await, and you need to await some kind of future that represents the server coming to a conclusion (it'll conclude if it has cancellation wired through).

@nfagerlund
Copy link
Owner Author

Hmm, so then I sort of need to decide who's allowed to spawn tasks, in terms of separation of concerns. I want the token and session DB helpers to be able to spawn fire-and-forget writes in their authenticate paths, because that unblocks the read portion from waiting on the writer thread... but I don't think I want the DB helpers to own the long-lived periodic session purging job.

Yeah, moving the async-last-used-update part out of the db helpers would make them way too annoying, so db definitely needs access to the task tracker. I think it doesn't need cancellation though?

nfagerlund added a commit that referenced this issue Apr 29, 2024
As described in the comments to #1,
and following the guidance from https://tokio.rs/tokio/topics/shutdown:

- Use a CancellationToken to signal long-running tasks to wrap up their business.
- Use a central TaskTracker to spawn any tasks that might need time to finish up.
- Listen to external signals in one spot, then translate them to a `.cancel()`
  that everyone else can listen for.
- Await all relevant futures before exiting. In our case, that's 1. the main
  serve loop, and 2. `tracker.close(); tracker.wait().await;`.
@nfagerlund
Copy link
Owner Author

nfagerlund commented May 5, 2024

An unanswered question: Hey, why does the "real" stdout of the fastcgi server get routed into apache's error_log on my local playground, but NOT on dreamhost?

Ok, so this has always been the case:

    if ((rv = apr_procattr_create(&procattr, procnode->proc_pool)) != APR_SUCCESS
        || (rv = apr_procattr_child_err_set(procattr,
                                            procinfo->main_server->error_log,
                                            NULL)) != APR_SUCCESS
        || (rv = apr_procattr_child_out_set(procattr,
                                            procinfo->main_server->error_log,
                                            NULL)) != APR_SUCCESS
// ...

OK yup, gdi, the answer's right in front of me: main_server->error_log. If I switch my local VirtualHost to have its own ErrorLog, it stays empty and the fastcgi app keeps spamming the main server error log. Well! That's! Quite bad!!!! I'd better silence the fmt subscriber when doing fcgi mode.

And, that confirms that I need to roll my own app log appender rather than figuring out how to log to apache logs.

lol, at least I wasn't setting RUST_LOG=trace on my soak test.

@nfagerlund
Copy link
Owner Author

shit dude, I think we've done it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant