-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Account for LLM sampling uncertainty via hierarchical bootstrap #1118
base: main
Are you sure you want to change the base?
Account for LLM sampling uncertainty via hierarchical bootstrap #1118
Conversation
5ffc69c
to
c715bd7
Compare
Hi @jjallaire @dragonstyle, this functionality is important to us and is one of the last things holding up the release of our new Inspect-based benchmarking hub. Is there anything I can do to make it easier for you to say yes? For example, I'm open to making this a new metric (or modifying the It's not sufficient for us to define a new metric outside Inspect, because (as explained under "Changes to types") we need access to the unreduced scores. |
A couple of comments:
Curious about your thoughts. |
I think it's arguable either way. The current As I said I'm happy to leave
Yes. The only argument against would be that it could be confusing that |
c715bd7
to
9911938
Compare
Done. What do you think of this option, @dragonstyle? |
I'd also like to get the take of @evanmiller-anthropic here. His feedback was what caused us to move away from @tadamcz Realize you are pressed to get something merged, but we also want to step carefully and make we get the appropriate feedback. |
I think bootstrapping in that situation is not wrong, just inefficient. You can check for yourself that they give very close answers so it can't really be fundamentally wrong, can it? Evan's paper said as much:
Though I'm happy to wait a bit to see if he wants to weigh in. |
I don't know about "fundamentally wrong", but I recall from discussion with Evan that he was quite unhappy it had fallen into common use. This is not my area of expertise so I'll defer to the subsequent discussion here. |
@tadamcz I think as you discovered in #1113, computing the variance of the reduced scores will take into account both sampling error and measurement error, and will result in an accurate standard error for the population mean. Essentially the sampling variance that you're worried about will manifest as noise in the question-level mean – mathematically, the total variance can be decomposed into sampling error and measurement error (in the paper, Var[x] and E[σ^2]), but in practice we don't have to decompose them – we can just compute the overall variance and that will neatly take both into account. This is easiest to think about with the epochs=1 case. It could either be the case that we observe each X with absolute certainty (σ^2=0), or it could be the case that the X's are all equal, and all the observed variance is attributed to model-sampling (measurement) variance. But it doesn't matter which case is "true"; both situations will result in the exact same variance on our estimate of the population mean. Generally we choose clustering observations over treating each cluster as an observation when we think there is a degree of independence of observations within each cluster and that we can "increase N" by looking inside the clusters. But in this case, the question-meaned standard error is equivalent to the most extreme version of the clustered standard error (i.e. assuming perfect correlation, so N=# questions rather than N = # questions x # epochs) so any attempt to "look inside the cluster" will result in a smaller standard error – which is probably just a recipe for Type I error. The only way we'd expect to get a larger standard error is if the within-cluster scores are negatively correlated, which I think we can safely rule out when using questions as clusters. For that reason I suspect your bootstrapping result, where the clustered SE is larger than the built-in SE, is spurious. So in sum, this would be a welcome improvement if we weren't already doing question-level averaging (which was the case a few months ago!) – but as things stand I think we're already taking the thing that you're worried about into account. As a final note, it's somewhat misleading to say in this context that the clustered standard error reduces to the heteroskedasticity-robust standard error with epochs=1 (a claim I saw in the other PR); because there's no independent variable in the "regression", the "robust" SE is mathematically equivalent to the one we're already computing. If you set X to a vector of 1's in https://en.wikipedia.org/wiki/Heteroskedasticity-consistent_standard_errors#Solution you can see that it reduces to the CLT SE. That might be the explanation you were looking for in the other PR when your numbers were coming out the same as the existing implementation. |
Thank you Evan, so glad you took the time to engage with this! It's good to have an actual expert on the case. I'm sure you're correct but I'm still quite confused, could you help me to understand the intuition here? It seems very strange to just ignore the number of epochs. If we increase the number of epochs, we reduce one source of uncertainty. It seems like this should be reflected somewhere? |
Is it correct to say that
Do you think it's wrong to have a smaller error estimate in this case? |
Thanks @evanmiller-anthropic for the detailed explanation. We discussed this internally and now agree that the existing stderr implementation already correctly accounts for model sampling variance - we'll use it moving forward. As the one advising @tadamcz on the statistical aspects here (I manage @tadamcz at Epoch), that's my bad for not catching this earlier. Sorry for the review overhead - we can close this PR. Really appreciate everyone taking the time to engage with the technical details here. |
This PR modifies the
bootstrap_stderr
metric by introducing hierarchical bootstrap.Motivation and previous behaviour
The motivation stems from how Inspect can sample an LLM multiple times (via the
epochs
parameter) for each question in a benchmark.Previously, we aggregated (reduced) each question’s samples into a single observation, and then computed the standard error of the mean of these question-level observations, as if they were known with certainty.
This approach ignored LLM sampling uncertainty — LLM outputs can vary from one epoch to another for the same question.
Instead we now use a 2-level bootstrap: first sample from questions with replacement, then sample from epochs with replacement. A bootstrap approach has the virtue of being very simple (there may be better ones).
Changes to types
I introduced a
ReducedScore
class as a subclass ofScore
. AReducedScore
keeps a reference to its “children” (alist[Score]
), i.e. the scores that were used to calculate the reduced score. This is because thestderr
metric needs to have access to the unreduced scores to calculate the bootstrap standard error.This is not a breaking change to other metrics (e.g.
accuracy
still works without code changes). Metrics can continue to use existingScore
functionality, without using thechildren
field of aReducedScore
.A
Reducer
still takes in aScore
, but now returns aReducedScore
instead of aScore
. TheMetric
protocol (e.g., stderr, std, mean, etc.) now operates on lists ofReducedScore
rather than lists ofScore
.Tests
Tests check that my fast implementation of the hierarchical bootstrap (which makes greater use of numpy vectorized operations) gets the same result as a readable version. You may find it easier to look at the readable version and persuade yourself that it does the right thing.
I also check that the behaviour is generally reasonable: e.g. if there is greater within-cluster variance, this increases the bootstrap stderr.
Future work
Potential future work: