Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce clustered (and heteroskedasticity-robust) standard errors #1113

Closed
wants to merge 7 commits into from

Conversation

tadamcz
Copy link
Contributor

@tadamcz tadamcz commented Jan 13, 2025

This PR corrects the stderr metric by introducing clustered standard errors [0].

Motivation and previous behaviour

The motivation stems from how Inspect can sample an LLM multiple times (via the epochs parameter) for each question in a benchmark.

Previously, we aggregated (reduced) each question’s samples into a single observation, and then computed standard errors from these question-level observations as if they were known with certainty.

This approach ignored LLM sampling uncertainty — LLM outputs can vary from one epoch to another for the same question.

With clustered standard errors, we instead think of each benchmark question as a cluster. For epochs=k, each cluster has k observations. Each cluster’s internal variance contributes to the overall standard error.

The previous approach assumed that within-cluster variation was zero. Just to be clear, this assumption is also incorrect if epochs=1. If epochs=1 we cannot estimate the within-cluster variance, but that does not mean we can assume it is zero.

The clustered standard error reduces to the heteroskedasticity-robust [1] standard error when epochs=1.

Changes to types

I introduced a ReducedScore class as a subclass of Score. A ReducedScore keeps a reference to its “children” (a list[Score]), i.e. the scores that were used to calculate the reduced score. This is because the stderr metric needs to have access to the unreduced scores to calculate the clustered standard error.

This is not a breaking change to other metrics (e.g. accuracy still works without code changes). Metrics can continue to use existing Score functionality, without using the children field of a ReducedScore.

A Reducer still takes in a Score, but now returns a ReducedScore instead of a Score. The Metric protocol (e.g., stderr, std, mean, etc.) now operates on lists of ReducedScore rather than lists of Score.

A ReducedScore could also be used recursively, e.g. when there are thematically related clusters of questions. This was suggested by Miller (Adding Error Bars to Evals, https://arxiv.org/abs/2411.00640):

We next consider eval questions that are drawn in groups, or clusters. For instance, DROP[6],
QuAC[5], RACE[13], and SQuAD[17] are reading-comprehension evals having multiple related
questions about independently selected passages of text, and multilingual evals such as MGSM[18]
consist of the same question translated into many languages. ... We note that the cluster adjustment in our
real-world example is far from trivial (up to 3X).

Tests

I have added tests, including to check that this implementation gets the same numerical results as clustered standard errors in statsmodels [2]

[0] : https://en.wikipedia.org/wiki/Clustered_standard_errors
[1] : https://en.wikipedia.org/wiki/Heteroskedasticity-consistent_standard_errors
[2] : https://github.com/statsmodels/statsmodels

cluster_dev[i] * cluster_dev[j]
for i in range(cluster_size)
for j in range(cluster_size)
if i != j
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simplify this code by removing the i != j condition as well as the separate unclustered_var term. I separated them in the main text of the paper for clarity, but you can see in the appendix that there's really just one (triple) summation.

clustered_var = unclustered_var + (cluster_covar / (n**2))

# Apply small sample correction g/(g-1) to variance
clustered_var *= g / (g - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't in the paper but I think this is right!

@@ -53,6 +53,7 @@
"bootstrap_std",
"std",
"stderr",
"clustered_stderr",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is extraneous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch!

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 13, 2025

Thanks for the comments! However, I'm currently wondering if this whole thing might be superfluous because Inspect was already (possibly unwittingly?) implementing something mathematically equivalent 🤦. In which case I apologise for wasting your time. I realised this while trying to create an example where the two give different answers.

I'm still a little confused but I think:

  • The previous implementation, which operates directly on reduced scores, computes the standard error of the mean of cluster means
  • And this might turn out to be equivalent to the clustered standard error, where clusters are benchmark questions (assumes equally-sized clusters, but this condition is satisfied)

This is still feeling counter-intuitive to me. They give the same answer even though the first one seems to ignore the within-cluster variation?

But look:

import numpy as np
import pandas as pd
import statsmodels.api as sm

def cluster_se(data: pd.DataFrame) -> float:
    # Add constant column of 1s.
    # We are running a trivial regression where we only estimate the y-intercept
    data['constant'] = 1

    model = sm.OLS(data['y'], data[['constant']])

    model = model.fit().get_robustcov_results(
        cov_type='cluster',
        groups=data['cluster_ids'],
    )
    assert len(model.bse) == 1
    return model.bse[0]

def sem(numbers: list[float]) -> float:
    """Standard error of the mean (with small sample correction)"""
    squared_deviations = [(x - np.mean(numbers)) ** 2 for x in numbers]
    small_sample_correction = len(numbers) / (len(numbers) - 1)
    v = np.mean(squared_deviations)
    v *= small_sample_correction
    return np.sqrt(v / len(numbers))

def sem_of_cluster_means(data: pd.DataFrame) -> float:
    """Standard error of the mean of cluster means"""
    cluster_means = data.groupby('cluster_ids')['y'].mean()
    return sem(cluster_means)

# Create the dataset
data = pd.DataFrame({
    "y": [1.0, 4.0, 11.0, 6.0, 13.0, 8.0],
    "cluster_ids": [1, 1, 2, 2, 3, 3]
})

print(cluster_se(data))
print(sem_of_cluster_means(data))

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 13, 2025

@dragonstyle since you're the author of the previous stderr implementation, can you comment on whether this was actually intentional?

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 13, 2025

I think the two expressions are equivalent, but this means I am once again confused as to what is the correct approach for modelling this problem. i.e. what is the correct overall variance metric to report? It seems clearly wrong to ignore LLM sampling uncertainty, but it looks like we are doing that here.

This is not my area of expertise, but the question seems quite important. If statisticians want to weigh in that would be appreciated.

@tadamcz
Copy link
Contributor Author

tadamcz commented Jan 14, 2025

Closing in favour of: #1118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants