Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cancellation of timer operations #110

Merged
merged 3 commits into from
Nov 10, 2024

Conversation

civodul
Copy link
Collaborator

@civodul civodul commented Oct 1, 2024

This patch series fixes #109 by supporting a "cancel" function for base operations and using it to remove timer wheel entries when a timer "loses" in a choice operation.

I'd like feedback in particular on the last commit: can someone confirm set! good enough, or should we use an atomic box instead?

I ran the test suite of the Shepherd and that of Cuirass against this branch: the former has good coverage though it uses a single POSIX thread, the latter has not-so-good coverage but uses multiple POSIX threads. Both passed.

Thoughts?

@civodul civodul requested a review from wingo October 1, 2024 20:41
@civodul civodul self-assigned this Oct 1, 2024
@wingo
Copy link
Owner

wingo commented Oct 10, 2024

Looking good, I do have a question though and it's a long one.

Basically, I want to ensure that Concurrent ML's withNack combinator is implementable in a composable way.

  1. If withNack is already implementable with what we have, then what does this patch offer in addition?
  2. If withNack is not currently possible, will it be possible with this work?

To explain withNack, first I should mention CML's guard combinator, which is essentially an operation generator: when you go to perform-operation, that operation may have a guard function, which if it is present will be called to return a (probably) fresh operation ("event", in CML language). guard lets CML's sync (our perform-operation) spawn threads at sync time, send message, whatever.

In fibers, we have considered guard to be not primitive: fibers provides primitive CML, and if you want full CML, you can layer on top (for example by having a wrapper to perform-operation that might generate additional events).

Now back to withNack. Quoting the CML book §4.2.5:

This combinator behaves like guard, in that it takes a function whose evaluation is delayed until synchronization time, and returns it as an event value. The main difference is that at synchronization time, the guard function is applied to an abort event, which is enabled only if some other event is selected in the synchronization.

And then the example goes like this (translated):

(define (annotate-operation op nack-fn)
  (withNack
    (lambda (nack-op)
      (spawn-fiber (lambda ()
                     (nack-fn (perform-operation nack-op))))
      op)))

The idea is that when you make a nack-event, your nack guard function will be called on a fresh event (operation), as if from a wait-operation on a fresh nack condition (see (fibers conditions)). The nack guard function should return a "positive" event (operation). If some other op synchronizes instead of the positive event, the nack condition is signalled.

The fiber spawned by the nack guard function could then, say, send a message back to a remote server to release some kind of resource.

The corner cases come in for withNack on choice operations:

(choice-operation op-a (withNack (choice-operation op-b op-c) ...))

Here we want the nack condition to signal if op-a synchronizes, but not to signal if either of op-b or op-c synchronize.

In https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=9393e5ba6fa5cdcd981fee71c3bdfee8841c047d §5.2, Donnely and Fluet show how to implement withNack in terms of lower-level primitives. But, I do not fully understand it yet.

So, again, my questions:

  1. If withNack is already implementable with what we have, then what encoding would it have? What does this patch offer in addition?
  2. If withNack is not currently possible, will it be possible with this work?

Also, when we settle on the answer, can I request some documentation, please? Thank you :)

@civodul
Copy link
Collaborator Author

civodul commented Oct 10, 2024

Hey @wingo,

I'm afraid I cannot answer your questions, this being my first time hearing about guard and withNack, which I'm not sure to fully understand.

What I can say is that this change preserves choice-operation semantics, which is that only one sub-operation succeeds. If withNack were built on top of those semantics, I don't see how explicit cancellation could affect it.

If we cannot answer the question of whether cancellation functions would prevent withNack from being implemented, perhaps what we could do is to not expose them in make-base-operation. Instead we'd make them available through a make-base-operation/internal binding or similar, to avoid committing to the interface change. WDYT?

Regarding documentation, the commit updates the @defun entry for make-base-operation with the new argument, but note that make-base-operation is not actually documented, only mentioned. What would you suggest?

Thanks for your detailed review!

(put-message channel 'hello)
(loop (+ i 1))))))

(let ((initial-heap-size (heap-size)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're checking heap sizes, best do a gc beforehand, otherwise some heavy activity could lead to false negatives (as in 'no bug detected even though it exists').

Would it be possible (and sufficiently informative & meaningful) to instead check the length of the timer wheel? Seems less finicky to me (e.g. what if in the future tests are run in parallel in a single process).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if something finicky like heap sizes is avoided, I imagine the number of iterations could be reduced a lot (good for test performance).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that when #109 is present, the heap would grow way beyond the 2x limit that's tested here; it would not go unnoticed. (Maybe we could make the test faster but it was already reasonably fast in my experience.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How fast is 'reasonably fast'?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that when #109 is present, the heap would grow way beyond the 2x limit that's tested here; it would not go unnoticed. (Maybe we could make the test faster but it was already reasonably fast in my experience.)

That's going to lead to false positives (*) in case of concurrent tests, or a hypothetical future 'Guile OS' where all Guile is run in a single process (with appropriate isolation, but also with a shared heap and GC).

(*) where positive = "there is a bug"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't answered yet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this comment as well. Again, I really don't think there's going to be false positives here; please judge for yourself by commenting out the "cancel" function of timers to see where it goes.

That said we can always add a comment in the test to clarify that.

Copy link
Collaborator

@emixa-d emixa-d Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refuse to [rewrite the foundations of operating systems, Guile processes or Fiber's test suite just to test this elementary logical conclusion]. (brackets added for clarity]. And why clarify things when you can just fix things? Surely a length check of the timer wheel would be straightforward, and less noisy than heap size information - it should even be feasible to check the exact length (two iterations should be sufficient, could be increased a little 'just in case').

Also, the question on speed remains unanswered.

;; (sched) -> ()
(cancel-fn base-op-cancel-fn))

(define* (make-base-operation wrap-fn try-fn block-fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ABI reasons, there needs to be a way to tell to Guile 'don't inline this' (not blocking this PR).

@LiberalArtist
Copy link

In https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=9393e5ba6fa5cdcd981fee71c3bdfee8841c047d §5.2, Donnely and Fluet show how to implement withNack in terms of lower-level primitives. But, I do not fully understand it yet.

So, again, my questions:

  1. If withNack is already implementable with what we have, then what encoding would it have? What does this patch offer in addition?

  2. If withNack is not currently possible, will it be possible with this work?

Also, when we settle on the answer, can I request some documentation, please? Thank you :)

I have just seen this issue and have only skimmed the paper you linked so far, but I wrote about the lack of withNack in #97.

In particular, I'd highlight § 7 of Flatt and Findler, “Kill-Safe Synchronization Abstractions” (PLDI 2004):

Recall that the event provided to a guard procedure by
nack-guard-evt becomes ready if the guard-generated event is
not chosen. MzScheme extends the Concurrent ML definition of
“not chosen” so that it includes all of the following cases, which
cover all of the ways that a thread can abandon an event:

  • The sync call chooses an event other than the one returned by
    the guard.
  • Control escapes from the sync call through an exception or
    continuation jump. The exception or jump may have been
    triggered through a break signal (discussed further in Sec-
    tion 8.2), by another guard involved in the same sync, or even
    by the guard procedure that received the NACK event. Con-
    tinuation jumps back into a guard are always blocked by our
    definition of nack-guard-evt, so multiple escapes are not
    possible.
  • The syncing thread terminates (i.e., it is suspended and un-
    reachable).

In the code from Figure 9, the event produced by
msg-queue-recv-evt can be used in an arbitrary client
context, so all of the above cases are possible.

MzScheme’s nack-guard-evt corresponds to Concurrent ML’s
withNack. An earlier version [19] of Concurrent ML offered
wrapAbort, instead, and a later presentation [21] explains how
withNack can be implemented with wrapAbort. Our definition
of “not chosen” does not allow such an implementation, and thus
strengthens the argument that withNack is the right operation to
designate as primitive.

I haven't read Donnely and Fluet closely enough (or re-read the implementation of wrapAbort from Concurrent Programming in ML) to know if it works with the Racket definition of “not chosen”, but in my experience Racket's definition of “not chosen” is useful, so it's something I'd want to know.

I wrote in #97 that I don't think make-base-operation is sufficient even to implement the guard combinator, and from a brief look I don't see how this PR would change that. (I hadn't considered what @wingo wrote about expecting guard to be implemented by a client abstraction that doesn't expose the underlying perform-operation, but that strikes me as unappealing in a similar way to providing an un-delimited call/cc and requiring client libraries to wrap it to implement delimited continuations. Part of the benefit of having a Concurrent ML library is that other libraries should be able to have operations in their APIs, and I think you loose expressive power if using guard or withNack would require a different abstraction.)

The mention of the fact in Flatt and Findler that “continuation jumps back into a guard are always blocked by our definition of nack-guard-evt” also reminded me of a bug report on the semantics of Guile's continuation barriers that I mostly wrote two years ago but never quite sent: I'll try to do that soon.

@civodul
Copy link
Collaborator Author

civodul commented Nov 3, 2024

Hey @LiberalArtist,

Thanks for the explanation and for the reference. It would seem to me that what I called "cancellation" here could actually help implement nack-guard-evt because it lets you know that an option in a choice operation was "not chosen".

At any rate, the goal of this PR is to fix #109, which I consider serious. As long as it doesn't prevent future work such as implementing nack-guard-evt (which seems to be the case in my understanding) and in the absence of alternate proposals to fix #109, I would suggest pushing this.

Would you object, @LiberalArtist, @wingo, or @emixa-d?

@LiberalArtist
Copy link

At any rate, the goal of this PR is to fix #109, which I consider serious. As long as it doesn't prevent future work such as implementing nack-guard-evt (which seems to be the case in my understanding)

Good point to avoid blocking this bug fix, if possible, while thinking through deeper questions.

If you went with this suggestion:

If we cannot answer the question of whether cancellation functions would prevent withNack from being implemented, perhaps what we could do is to not expose them in make-base-operation. Instead we'd make them available through a make-base-operation/internal binding or similar, to avoid committing to the interface change. WDYT?

do I understand correctly that this fix would not add any new API, just fix a problem with the current implementation? Or is (fibers timer-wheel) also public?

If there's no new API, I don't see how this could possibly pose a (new) problem.

One subtlety for cancellation and withNack is that being “not chosen” has to be atomic: if the guarded event is chosen, the NACK event must never become ready for synchronization. Here's a short (and therefore contrived) example:

#lang racket
(define saved-nack #f)
(sync (nack-guard-evt (λ (gave-up-evt)
                        (set! saved-nack gave-up-evt)
                        always-evt)))
(sync saved-nack) ; must never return

(define wheel-entry
;; If true, this is the currently active timer entry for this operation.
#f)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a problem. Previously, operation objects are reusable, and nothing suggested you need to re-create the operation object (convenient for loops!). Now they aren't, and worse it's undocumented.

If state is needed, it needs to be moved into arguments of one of the 'lambdas' below (is API change, but you could define separate 'make-base-operation' and 'make-base-operation/stateful'). Another option is to rename 'flag' to state', make it a pair of (atomic box with flag . wheel-entry) and allow overriding the default construction of the state ((fibers operations) initialises to an atomic box, but it doesn't actually use its contents -- doing something with it is left entirely to the individual implementations).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that passing state around would be more elegant. In practice though, the current approach is okay IMO because the variable is closed over by the closures of the operation.

In this case, I fixed the problem you mentioned (being able to reuse a timer operation after it's been "canceled") simply by resetting the timer-wheel variable upon cancellation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice though, the current approach is okay IMO because the variable is closed over by the closures of the operation.

AFAICT, there is still a problem if a single timer operation value is used concurrently from multiple fibers. (Having guard-operation would make this pattern work.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, there is still a problem if a single timer operation value is used concurrently from multiple fibers. (Having guard-operation would make this pattern work.)

Apologies @LiberalArtist, I missed this comment of yours.

Would using an atomic box (as I suggested in the first message) solve the problem?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it might be possible to adjust the timer wheel implementation to use atomic boxes everywhere + compare and swap, but that seems inefficient, and because two pointers need to be replaced, difficult to do correctly and verify for correctness. It seems simpler & less error-prone to me to simply add a state argument.

In case of 'replace #f' by 'atomic box containing #f', no. We need to assign things to the right thread (in particular, the right scheduler, because of work stealing(?)) and atomic boxes don't do such things, they impose ordering constraints.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note: IIUC, if/when guard is created, an explicit 'state' argument could be eliminated, although a 'state' operation-construction API could still be provided for convenience.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, re atomicity, I was thinking about the (set! timer-wheel ...) bit, but you're saying the timer wheel implementation itself should be made atomic?

I must say I'm unclear on that, though my understanding is that there's one timer wheel per scheduler and one scheduler per thread, no?

@wingo?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying that it should be made atomic. The state version is also an option.

though my understanding is that there's one timer wheel per scheduler and one scheduler per thread, no?

And that's why, if you don't go for the state version, it needs to be atomic -- you are saving a timer wheel entry of the current scheduler in the closure, and due to work stealing the fiber might migrate to another scheduler (IIUC), so the cancellation can be run from another scheduler, and fiddling with another threads data structures leads to trouble unless designed for that.

fibers/operations.scm Outdated Show resolved Hide resolved
fibers.texi Outdated Show resolved Hide resolved
@emixa-d
Copy link
Collaborator

emixa-d commented Nov 4, 2024

At any rate, the goal of this PR is to fix #109, which I consider serious. As long as it doesn't prevent future work such as implementing nack-guard-evt (which seems to be the case in my understanding) and in the absence of alternate proposals to fix #109, I would suggest pushing this.

Would you object, @LiberalArtist, @wingo, or @emixa-d?

I don't think questions on whether it is sufficient withNack are particularly essential yet. It is necessary for perform-operation, so it also is necessary for withNack (although possibly insufficient). It might perhaps be the case that it is insufficient and a different API might be needed in the future, but I don't expect anyone to bring an implementation for withNack in Fibers anytime soon. As long as we are honest and upfront about not being certain what the cancel API should be / will be, a future API break is fine.

Also see my comments on code and tests.

@emixa-d
Copy link
Collaborator

emixa-d commented Nov 4, 2024

do I understand correctly that this fix would not add any new API, just fix a problem with the current implementation? Or is (fibers timer-wheel) also public?

No. It adds API to make-base-operation, which is public (though lacking in documentation -- it's only documentation is that it exists, what its function is, and what arguments it has). It also makes an incompatible change to semantics: it makes timer operations non-reusable, you need to make a copy each time to re-perform it. (At least, it appears that it might get confused somewhere w.r.t. the timer wheel thingie.)

Where it is necessary (or not strictly necessary, but very convenient for implementation or performance), I'm not against making operations single-use, but then it should actually be documented that the particular operation is single-use (and it should also be defined what single-use means: only once per call to perform-operation? or only once performed (not the same thing in case of wrap-operation)?).

If there's no new API, I don't see how this could possibly pose a (new) problem.

One subtlety for cancellation and withNack is that being “not chosen” has to be atomic: if the guarded event is chosen, the NACK event must never become ready for synchronization. Here's a short (and therefore contrived) example:

#lang racket
(define saved-nack #f)
(sync (nack-guard-evt (λ (gave-up-evt)
                        (set! saved-nack gave-up-evt)
                        always-evt)))
(sync saved-nack) ; must never return

Why? I gather that this is Racket semantics, but I don't see why this limitation should be included in Guile -- the second sync (perform-operation?) is, well, a second sync, not the first sync. Some kind of state might perhaps be needed for nack stuff, but that state doesn't need to be part of the operation object itself (or maybe it does for that 'perform an operation from within another operation stuff', idk). And if the state is part of the object itself, then see what I wrote previously about operation reuse.

Edit: right, I imagine this might perhaps be useful for complex combinations of operations and nesting, but this still is a serious incompatible change that shouldn't be swept under the rug (in a previous version of Scheme-GNUnet, I constructed a complex operation before a loop and then reused it as an optimisation). So, should be a v2.0 version thing.

@LiberalArtist
Copy link

In https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=9393e5ba6fa5cdcd981fee71c3bdfee8841c047d §5.2, Donnely and Fluet show how to implement withNack in terms of lower-level primitives. But, I do not fully understand it yet.

I've read (part of) this again, and I think (but I could definitely be wrong!) that the "lower-level primitive" needed is thenEvt from their proposed TE Haskell system: that is, withNack is still primitive to CML. In particular, note that their TE Haskell Evt type is different than their CMLEvt type:

type CMLEvt a = IO ([AckVar], Evt ([AckVar], IO a))

I'm also not sure to what extent Donnely and Fluet's system makes sense in a language with impure side-effects. (It might indeed make sense, I just haven't figured it out for myself one way or the other yet.)

If you're interested in transactions and three-way rendezvous, I remember someone (maybe Sam Tobin-Hochstadt?) saying Aaron Turon's “reagents” would be a good place to look (but I haven't read it in any kind of detail).

I don't expect anyone to bring an implementation for withNack in Fibers anytime soon

It's not for me to say what the goal of Fibers should be.

But if the goal is to provide an implementation of CML primitives—and I do think that is a useful goal!—then I would argue that Fibers should plan to provide both missing CML primitives, guard-operation and nack-guard-operation, especially since Reppy, as I cited in #97, says that the withNack combinator is a unique contribution of CML.

Why? I gather that this is Racket semantics, but I don't see why this limitation should be included in Guile -- the second sync (perform-operation?) is, well, a second sync, not the first sync. Some kind of state might perhaps be needed for nack stuff, but that state doesn't need to be part of the operation object itself (or maybe it does for that 'perform an operation from within another operation stuff', idk). And if the state is part of the object itself, then see what I wrote previously about operation reuse.

This is the Concurrent ML semantics of guard and withNack: the callback function is called each time the guarded event is passed to sync (aka perform-operation). The NACK event is thus specific to that particular sync. Semantically, you could define guard-operation in terms of nack-guard-operation (but not creating unused NACKs is obviously more efficient):

(define (guard-operation thunk)
  (nack-guard-operation (lambda (ignored-nack)
                          (thunk)))

This is why I wrote in #97:

I don't think make-base-operation is sufficient to implement either the guard or the withNack combinator. A naïve implementation of guard-operation could use try and block functions that close over some shared mutable state, but there doesn't seem to be any way for an implementation to distinguish each call to perform-operation, which it needs to do in order to get a fresh underlying operation from the thunk for each call.

As far as reuse:

This looks like a problem. Previously, operation objects are reusable, and nothing suggested you need to re-create the operation object (convenient for loops!). Now they aren't, and worse it's undocumented.

Yes, operations should be reusable, but maybe in a subtler sense than one might initially guess.

I think the Donnely and Fluet paper explained this well on p. 653, in particular sentence I've put in bold here:

In general, one often wants to implement a protocol consisting of a sequence of communications c_1 ; c_2 ; … ; c_n. To use such a protocol in CML, one of the c_i must be designated as the commit point, the communication by which this protocol is chosen over others in a choose. The entire protocol may be packaged as an event value by using guard to prefix the communications c_1 ; … ; c_i−1 and using wrap to postfix the communications c_i+1 ; … ; c_n. Note all of the pre-synchronous communications must terminate in order for guard to yield the commit point communication; likewise, all of the post-synchronous communications must terminate in order for wrap to yield the synchronization result. One must be careful to maintain program invariants after pre-synchronous actions, since the corresponding commit point communication and post-synchronous action may not be chosen. This motivates the need for withNack,
as a mean to signal compensating actions in non-chosen protocols.

(I'd have to look more closely to say whether the implementation in this PR maintains the needed invariants.)

For a concrete example, I strongly recommend the paper “Kill-Safe Synchronization Abstractions”, which walks through the implementation of the racket/async-channel library. Given some definitions:

(define ach
  (make-async-channel))
(define e
  (async-channel-put-evt ach "message"))

The e operation is completely reusable: you can use it to try to put "message" on ach as many times as you want, and, indeed, it can successfully put "message" on ach many times. In contrast, the implementation of async-channel-put-evt may use a guard or withNack callback to create some operations for each attempt to put "message" on ach. Those operations, NACKs or otherwise, are also reusable (in that the manager fiber for ach may supply such operations to sync many times, e.g. if it services requests from other fibers while our initial fiber remains blocked), but they are specific to one particular attempt to communicate. The operation returned by a guard/withNack callback is only accessible to
the implementation of perform-operation.

Of course, it is possible to implement an operation that won't become ready more than once:

(define (once-operation val)
  (define ch (make-channel))
  (spawn-fiber (lambda ()
                 (put-message ch val)))
  (get-operation ch))

do I understand correctly that this fix would not add any new API, just fix a problem with the current implementation? Or is (fibers timer-wheel) also public?

No. It adds API to make-base-operation, which is public (though lacking in documentation -- it's only documentation is that it exists, what its function is, and what arguments it has).

Sorry, that may not have been totally clear: I was assuming at that point taking @civodul's suggestion in #110 (comment) to not change make-base-operation and instead add cancellation to make-base-operation/internal or something.

But is (fibers timer-wheel) public?

It also makes an incompatible change to semantics: it makes timer operations non-reusable, you need to make a copy each time to re-perform it. (At least, it appears that it might get confused somewhere w.r.t. the timer wheel thingie.)

This is a separate, important question that, as I said, I'd have to look again more closely to try to answer.

@emixa-d
Copy link
Collaborator

emixa-d commented Nov 5, 2024

I don't expect anyone to bring an implementation for withNack in Fibers anytime soon

It's not for me to say what the goal of Fibers should be.

But if the goal is to provide an implementation of CML primitives—and I do think that is a useful goal!—then I would argue that Fibers should plan to provide both missing CML primitives, guard-operation and nack-guard-operation, especially since Reppy, as I cited in #97, says that the withNack combinator is a unique contribution of CML.

Note that this reply and what you replied to don't have to do with each other - fixing cancellation shouldn't hinder implementation of extra operations, and if those extra operations need changes to cancellation that that can simply be done, as mentioned previously ... Also, it is for you to say what the goal should be as much as it is for anyone else here.

Also, I don't consider that the goal of Fibers. For me the goal of Fibers is to provide a useful concurrency library based on fibers and composable operations, not to reimplement CML the exact same way (neither is it to not do that, and neither is the goal to reinvent the wheel).

Why? I gather that this is Racket semantics, but I don't see why this limitation should be included in Guile -- the second sync (perform-operation?) is, well, a second sync, not the first sync. Some kind of state might perhaps be needed for nack stuff, but that state doesn't need to be part of the operation object itself (or maybe it does for that 'perform an operation from within another operation stuff', idk). And if the state is part of the object itself, then see what I wrote previously about operation reuse.

This is the Concurrent ML semantics of guard and withNack: the callback function is called each time the guarded event is passed to sync (aka perform-operation). The NACK event is thus specific to that particular sync. Semantically, you could define guard-operation in terms of nack-guard-operation (but not creating unused NACKs is obviously more efficient):

(define (guard-operation thunk)
  (nack-guard-operation (lambda (ignored-nack)
                          (thunk)))

"Q: Why do this semantics of P for R? A: because that's the semantics of Q too."

Your description of this semantics doesn't explain why these semantics (to people who don't share your goal of implementing Concurrent ML'). 'X being particular to a Y' could perhaps also be handled by ... what I mentioned in my comment about state (and elsewhere about extra arguments to procedures in the make-base-operation). Or possibly it can't (going by the spawn-fiber stuff and synchronisation point stuff, it likely can't, but your explanation doesn't work).

This is why I wrote in #97:

I don't think make-base-operation is sufficient to implement either the guard or the withNack combinator. A naïve implementation of guard-operation could use try and block functions that close over some shared mutable state, but there doesn't seem to be any way for an implementation to distinguish each call to perform-operation, which it needs to do in order to get a fresh underlying operation from the thunk for each call.

This does not follow. Why would they need to distinguish between each call, and why would it need to be fresh? This does not at all seem necessary to get underlying operation of the 'guard-evt' thunk (which does not actually need to be fresh -- because of its nature it probably is fresh, but there is no actual requirement for freshness, and in case of 'only perform once' operations, it sometimes might perhaps be useful to allow for returning the same operation multiples times -- in short, it's none of guard-evt business whether the operations generated by the thunk are fresh).

Going by the documentation of guard-evt and nack-guard-evt, it shouldn't distinguish between calls of perform-operation (emphasis added):

[guard-evt] Event generation is important for one-sec-timeout, which must construct an alarm
time based on the time that one-sec-timeout is used, not when
one-sec-timeout is created.

-- all the perform-operation are treated the same way: call the thunk to get an operation, then perform that operation, don't add state on top of this beyond what is done by the (user-provided) thunk.

I can imagine that re-use of the void-argument (in nack) is much more limited (and perhaps changes in make-base-operation might come in there, but you haven't proven that part), but's that a much more limited area.

As far as reuse:

This looks like a problem. Previously, operation objects are reusable, and nothing suggested you need to re-create the operation object (convenient for loops!). Now they aren't, and worse it's undocumented.

Yes, operations should be reusable, but maybe in a subtler sense than one might initially guess.

I think the Donnely and Fluet paper explained this well on p. 653, in particular sentence I've put in bold here:

In general, one often wants to implement a protocol consisting of a sequence of communications c_1 ; c_2 ; … ; c_n. To use such a protocol in CML, one of the c_i must be designated as the commit point, the communication by which this protocol is chosen over others in a choose. The entire protocol may be packaged as an event value by using guard to prefix the communications c_1 ; … ; c_i−1 and using wrap to postfix the communications c_i+1 ; … ; c_n. Note all of the pre-synchronous communications must terminate in order for guard to yield the commit point communication; likewise, all of the post-synchronous communications must terminate in order for wrap to yield the synchronization result. One must be careful to maintain program invariants after pre-synchronous actions, since the corresponding commit point communication and post-synchronous action may not be chosen. This motivates the need for withNack,
as a mean to signal compensating actions in non-chosen protocols.

timer-operation is not one of those. If for some protocol you want a particular operation to only do its thing once (parts in sequence, ...) and otherwise act as a 'do-never-succeed' operations that's fine, but that's the exception. It should be limited to particular operation where it is the (only) sensible behavior, and to `(make-only-once-operation [base operation])`` wrappers, not something implicit that it's usually treated as in descriptions.

For a concrete example, I strongly recommend the paper “Kill-Safe Synchronization Abstractions”, which walks through the implementation of the racket/async-channel library. Given some definitions:
[...]

I recommend against that paper, because the following paragraph from the paper is untrue in Fibers

If a queue becomes unreachable, its manager thread is garbage
collected. More generally, when a thread becomes permanently
blocked because all objects that can unblock it become unreachable, the thread itself becomes unreachable, and its >resources can
be reclaimed by the garbage collector.

(there is no such GC behaviour in Fibers yet -- fibers are strongly references to IIRC). Also, that behaviour is wrong -- if it's inside a dynamic-wind* for clean-up-by-unwinding, such unwinding and cleanup should be performed. It also is wrong in another sense -- fibers are also used for other purposes than writing complex operations, usually in case of 'fiber can't ever unblock because references gone', it is preferable to let it error out so people know there is a bug. (For 'spawn a fiber and just GC it if can't unblock because of GC reachability reasons', there can be a separate spawn-fiber/gc or keyword argument.)

Of course, it is possible to implement an operation that won't become ready more than once:

(define (once-operation val)
  (define ch (make-channel))
  (spawn-fiber (lambda ()
                 (put-message ch val)))
  (get-operation ch))

This is not an implementation of the thing, see my previous paragraph.

do I understand correctly that this fix would not add any new API, just fix a problem with the current implementation? Or is (fibers timer-wheel) also public?

No. It adds API to make-base-operation, which is public (though lacking in documentation -- it's only documentation is that it exists, what its function is, and what arguments it has).

Sorry, that may not have been totally clear: I was assuming at that point taking @civodul's suggestion in #110 (comment) to not change make-base-operation and instead add cancellation to make-base-operation/internal or something.

But is (fibers timer-wheel) public?

There is no list of public/non-public in Fibers. I would treat it as non-public.

It also makes an incompatible change to semantics: it makes timer operations non-reusable, you need to make a copy each time to re-perform it. (At least, it appears that it might get confused somewhere w.r.t. the timer wheel thingie.)

This is a separate, important question that, as I said, I'd have to look again more closely to try to answer.

It's the other way around -- this is not the separate question, all the questions about 'guard' and 'nack' are the separate questions (see: title of this PR).

@civodul civodul force-pushed the wip-cancel-timer-operations branch from 027411a to 19e992b Compare November 6, 2024 18:16
@civodul
Copy link
Collaborator Author

civodul commented Nov 6, 2024

do I understand correctly that this fix would not add any new API, just fix a problem with the current implementation? Or is (fibers timer-wheel) also public?

If there's no new API, I don't see how this could possibly pose a (new) problem.

The update I just pushed introduces make-base-operation/internal (with a docstring that explains what cancel-fn is about and stresses that this is internal-use-only) and uses that in (fibers timers).

* fibers/operations.scm (<base-op>): Rename constructor to
‘%make-base-operation’.
[cancel-fn]: New field.
(make-base-operation/internal, cancel-other-operations): New procedures.
(perform-operation)[block]: Define ‘resume’ to call
‘cancel-other-operations’ because calling the real ‘resume’.
* fibers/timer-wheel.scm (timer-wheel-remove!): New procedure.
Fixes wingo#109.

Previously, an operation like:

  (choice-operation (sleep-operation 1234) (get-operation channel))

would accumulate timer wheel entries every time the ‘get’ operation wins
over the ‘sleep’ operation, potentially leading to unbounded memory
usage (each ‘sleep’ timer and its associated continuation would remain
on the wheel for 1234 seconds in this case).

This commit fixes it by removing the timer wheel entry as soon as the
timer operation is canceled.

* fibers/timers.scm (timer-operation)[wheel-entry]: New variable.
Set it in block function.  Use ‘make-timer-operation/internal’ and add
cancel function.
* fibers/scheduler.scm (scheduler-timers): Export.
* tests/cancel-timer.scm: New file.
* Makefile.am (TESTS): Add it.
@civodul civodul force-pushed the wip-cancel-timer-operations branch from 19e992b to 6facb6a Compare November 10, 2024 15:05
@civodul
Copy link
Collaborator Author

civodul commented Nov 10, 2024

Oops, I messed up with the default value of the cancel-fn field in that second version. Should be fixed now, waiting for CI...

@civodul civodul merged commit 01f475f into wingo:master Nov 10, 2024
4 checks passed
@civodul civodul deleted the wip-cancel-timer-operations branch November 10, 2024 15:19
@@ -141,6 +142,18 @@
(else
(timer-wheel-add! (or outer (add-outer-wheel! wheel)) t obj)))))))

(define (timer-wheel-remove! wheel entry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument wheel is unused

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak on choice operation of 'get' and 'sleep'
4 participants