Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for LLM sampling uncertainty via hierarchical bootstrap #1118

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

tadamcz
Copy link
Contributor

@tadamcz tadamcz commented Jan 14, 2025

This PR modifies the bootstrap_stderr metric by introducing hierarchical bootstrap.

Motivation and previous behaviour

The motivation stems from how Inspect can sample an LLM multiple times (via the epochs parameter) for each question in a benchmark.

Previously, we aggregated (reduced) each question’s samples into a single observation, and then computed the standard error of the mean of these question-level observations, as if they were known with certainty.

This approach ignored LLM sampling uncertainty — LLM outputs can vary from one epoch to another for the same question.

Instead we now use a 2-level bootstrap: first sample from questions with replacement, then sample from epochs with replacement. A bootstrap approach has the virtue of being very simple (there may be better ones).

Changes to types

I introduced a ReducedScore class as a subclass of Score. A ReducedScore keeps a reference to its “children” (a list[Score]), i.e. the scores that were used to calculate the reduced score. This is because the stderr metric needs to have access to the unreduced scores to calculate the bootstrap standard error.

This is not a breaking change to other metrics (e.g. accuracy still works without code changes). Metrics can continue to use existing Score functionality, without using the children field of a ReducedScore.

A Reducer still takes in a Score, but now returns a ReducedScore instead of a Score. The Metric protocol (e.g., stderr, std, mean, etc.) now operates on lists of ReducedScore rather than lists of Score.

Tests

Tests check that my fast implementation of the hierarchical bootstrap (which makes greater use of numpy vectorized operations) gets the same result as a readable version. You may find it easier to look at the readable version and persuade yourself that it does the right thing.

I also check that the behaviour is generally reasonable: e.g. if there is greater within-cluster variance, this increases the bootstrap stderr.

Future work

Potential future work:

  • make it work with inhomogenous clusters
  • implement more sophisticated bootstrap variants
  • expose the breakdown between within- and across-cluster variation to the user

@tadamcz tadamcz force-pushed the hierarchical-bootstrap-stderr branch from 5ffc69c to c715bd7 Compare January 14, 2025 17:26
@tadamcz tadamcz marked this pull request as ready for review January 14, 2025 17:31
@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 16, 2025

Hi @jjallaire @dragonstyle, this functionality is important to us and is one of the last things holding up the release of our new Inspect-based benchmarking hub. Is there anything I can do to make it easier for you to say yes? For example, I'm open to making this a new metric (or modifying the bootstrap_stderr metric) instead of a modification to stderr, if that would make a difference to you.

It's not sufficient for us to define a new metric outside Inspect, because (as explained under "Changes to types") we need access to the unreduced scores.

@dragonstyle
Copy link
Collaborator

A couple of comments:

  • It doesn't seem to me this replaces stderr since it is a bootstrapping implementation. I think the stderr metric is still a reasonable metric to support.

  • If the above is correct, I think the decision is whether to replace our current bootstrap implementation with this 2 level bootstrap or introduce a new metric. It seems to me that replacement might make the most sense assuming I'm correct in thinking this is strictly an improvement in the bootstrapping approach (which will now account for within sample variance).

Curious about your thoughts.

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 16, 2025

It doesn't seem to me this replaces stderr since it is a bootstrapping implementation. I think the stderr metric is still a reasonable metric to support.

I think it's arguable either way. The current stderr is arguably incorrect in that it ignores LLM sampling variation. On the other hand one might think it's nice to have a metric that's calculated in closed form.

As I said I'm happy to leave stderr alone here. I'd just probably change its docstring to clarify its shortcomings. Let me push a commit for that and you can see what you think of it?

It seems to me that replacement might make the most sense assuming I'm correct in thinking this is strictly an improvement in the bootstrapping approach

Yes. The only argument against would be that it could be confusing that stderr and bootstrap_stderr no longer estimate the same thing. But I think this is OK if the docstrings are clear.

@tadamcz tadamcz force-pushed the hierarchical-bootstrap-stderr branch from c715bd7 to 9911938 Compare January 16, 2025 17:30
@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 16, 2025

Done. What do you think of this option, @dragonstyle?

@jjallaire
Copy link
Collaborator

I'd also like to get the take of @evanmiller-anthropic here. His feedback was what caused us to move away from bootstrap_std to stderr (his argument being that the original use of bootstrap_std was fundamentally wrong for sample sizes > 30).

@tadamcz Realize you are pressed to get something merged, but we also want to step carefully and make we get the appropriate feedback.

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 16, 2025

I think bootstrapping in that situation is not wrong, just inefficient. You can check for yourself that they give very close answers so it can't really be fundamentally wrong, can it?

Evan's paper said as much:

We note that it is common practice to compute the standard error of the mean by bootstrapping; see, for instance, the OpenAI evals[16] frameworks. But the Central Limit Theorem is applicable to any evals having scores with finite variance and a large number of questions, and so we regard bootstrapping as unnecessary unless a complicated sampling scheme or estimator is being used.

Though I'm happy to wait a bit to see if he wants to weigh in.

@jjallaire
Copy link
Collaborator

I don't know about "fundamentally wrong", but I recall from discussion with Evan that he was quite unhappy it had fallen into common use. This is not my area of expertise so I'll defer to the subsequent discussion here.

@evanmiller-anthropic
Copy link
Contributor

@tadamcz I think as you discovered in #1113, computing the variance of the reduced scores will take into account both sampling error and measurement error, and will result in an accurate standard error for the population mean. Essentially the sampling variance that you're worried about will manifest as noise in the question-level mean – mathematically, the total variance can be decomposed into sampling error and measurement error (in the paper, Var[x] and E[σ^2]), but in practice we don't have to decompose them – we can just compute the overall variance and that will neatly take both into account.

This is easiest to think about with the epochs=1 case. It could either be the case that we observe each X with absolute certainty (σ^2=0), or it could be the case that the X's are all equal, and all the observed variance is attributed to model-sampling (measurement) variance. But it doesn't matter which case is "true"; both situations will result in the exact same variance on our estimate of the population mean.

Generally we choose clustering observations over treating each cluster as an observation when we think there is a degree of independence of observations within each cluster and that we can "increase N" by looking inside the clusters. But in this case, the question-meaned standard error is equivalent to the most extreme version of the clustered standard error (i.e. assuming perfect correlation, so N=# questions rather than N = # questions x # epochs) so any attempt to "look inside the cluster" will result in a smaller standard error – which is probably just a recipe for Type I error. The only way we'd expect to get a larger standard error is if the within-cluster scores are negatively correlated, which I think we can safely rule out when using questions as clusters. For that reason I suspect your bootstrapping result, where the clustered SE is larger than the built-in SE, is spurious.

So in sum, this would be a welcome improvement if we weren't already doing question-level averaging (which was the case a few months ago!) – but as things stand I think we're already taking the thing that you're worried about into account. As a final note, it's somewhat misleading to say in this context that the clustered standard error reduces to the heteroskedasticity-robust standard error with epochs=1 (a claim I saw in the other PR); because there's no independent variable in the "regression", the "robust" SE is mathematically equivalent to the one we're already computing. If you set X to a vector of 1's in https://en.wikipedia.org/wiki/Heteroskedasticity-consistent_standard_errors#Solution you can see that it reduces to the CLT SE. That might be the explanation you were looking for in the other PR when your numbers were coming out the same as the existing implementation.

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 16, 2025

Thank you Evan, so glad you took the time to engage with this! It's good to have an actual expert on the case.

I'm sure you're correct but I'm still quite confused, could you help me to understand the intuition here? It seems very strange to just ignore the number of epochs.

If we increase the number of epochs, we reduce one source of uncertainty. It seems like this should be reflected somewhere?

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 16, 2025

Is it correct to say that stderr essentially assumes that epochs=1 even when epochs>1? So, it does reflect the existence of uncertainty due to LLM sampling noise, but it fails to reflect the fact that this uncertainty is reduced when we add more epochs?

so any attempt to "look inside the cluster" will result in a smaller standard error – which is probably just a recipe for Type I error

Do you think it's wrong to have a smaller error estimate in this case?

@js-d
Copy link

js-d commented Jan 17, 2025

Thanks @evanmiller-anthropic for the detailed explanation. We discussed this internally and now agree that the existing stderr implementation already correctly accounts for model sampling variance - we'll use it moving forward. As the one advising @tadamcz on the statistical aspects here (I manage @tadamcz at Epoch), that's my bad for not catching this earlier.

Sorry for the review overhead - we can close this PR. Really appreciate everyone taking the time to engage with the technical details here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants