Introduce clustered (and heteroskedasticity-robust) standard errors #1113

tadamcz · 2025-01-13T15:43:11Z

This PR corrects the stderr metric by introducing clustered standard errors [0].

Motivation and previous behaviour

The motivation stems from how Inspect can sample an LLM multiple times (via the epochs parameter) for each question in a benchmark.

Previously, we aggregated (reduced) each question’s samples into a single observation, and then computed standard errors from these question-level observations as if they were known with certainty.

This approach ignored LLM sampling uncertainty — LLM outputs can vary from one epoch to another for the same question.

With clustered standard errors, we instead think of each benchmark question as a cluster. For epochs=k, each cluster has k observations. Each cluster’s internal variance contributes to the overall standard error.

The previous approach assumed that within-cluster variation was zero. Just to be clear, this assumption is also incorrect if epochs=1. If epochs=1 we cannot estimate the within-cluster variance, but that does not mean we can assume it is zero.

The clustered standard error reduces to the heteroskedasticity-robust [1] standard error when epochs=1.

Changes to types

I introduced a ReducedScore class as a subclass of Score. A ReducedScore keeps a reference to its “children” (a list[Score]), i.e. the scores that were used to calculate the reduced score. This is because the stderr metric needs to have access to the unreduced scores to calculate the clustered standard error.

This is not a breaking change to other metrics (e.g. accuracy still works without code changes). Metrics can continue to use existing Score functionality, without using the children field of a ReducedScore.

A Reducer still takes in a Score, but now returns a ReducedScore instead of a Score. The Metric protocol (e.g., stderr, std, mean, etc.) now operates on lists of ReducedScore rather than lists of Score.

A ReducedScore could also be used recursively, e.g. when there are thematically related clusters of questions. This was suggested by Miller (Adding Error Bars to Evals, https://arxiv.org/abs/2411.00640):

We next consider eval questions that are drawn in groups, or clusters. For instance, DROP[6],
QuAC[5], RACE[13], and SQuAD[17] are reading-comprehension evals having multiple related
questions about independently selected passages of text, and multilingual evals such as MGSM[18]
consist of the same question translated into many languages. ... We note that the cluster adjustment in our
real-world example is far from trivial (up to 3X).

Tests

I have added tests, including to check that this implementation gets the same numerical results as clustered standard errors in statsmodels [2]

[0] : https://en.wikipedia.org/wiki/Clustered_standard_errors
[1] : https://en.wikipedia.org/wiki/Heteroskedasticity-consistent_standard_errors
[2] : https://github.com/statsmodels/statsmodels

evanmiller-anthropic · 2025-01-13T18:31:25Z

src/inspect_ai/scorer/_metrics/std.py

+                    cluster_dev[i] * cluster_dev[j]
+                    for i in range(cluster_size)
+                    for j in range(cluster_size)
+                    if i != j


You could simplify this code by removing the i != j condition as well as the separate unclustered_var term. I separated them in the main text of the paper for clarity, but you can see in the appendix that there's really just one (triple) summation.

evanmiller-anthropic · 2025-01-13T18:31:55Z

src/inspect_ai/scorer/_metrics/std.py

+        clustered_var = unclustered_var + (cluster_covar / (n**2))
+
+        # Apply small sample correction g/(g-1) to variance
+        clustered_var *= g / (g - 1)


This wasn't in the paper but I think this is right!

dragonstyle · 2025-01-13T19:01:12Z

src/inspect_ai/scorer/__init__.py

@@ -53,6 +53,7 @@
    "bootstrap_std",
    "std",
    "stderr",
+    "clustered_stderr",


I think this is extraneous.

Yes, good catch!

tadamcz · 2025-01-13T20:06:56Z

Thanks for the comments! However, I'm currently wondering if this whole thing might be superfluous because Inspect was already (possibly unwittingly?) implementing something mathematically equivalent 🤦. In which case I apologise for wasting your time. I realised this while trying to create an example where the two give different answers.

I'm still a little confused but I think:

The previous implementation, which operates directly on reduced scores, computes the standard error of the mean of cluster means
And this might turn out to be equivalent to the clustered standard error, where clusters are benchmark questions (assumes equally-sized clusters, but this condition is satisfied)

This is still feeling counter-intuitive to me. They give the same answer even though the first one seems to ignore the within-cluster variation?

But look:

import numpy as np
import pandas as pd
import statsmodels.api as sm

def cluster_se(data: pd.DataFrame) -> float:
    # Add constant column of 1s.
    # We are running a trivial regression where we only estimate the y-intercept
    data['constant'] = 1

    model = sm.OLS(data['y'], data[['constant']])

    model = model.fit().get_robustcov_results(
        cov_type='cluster',
        groups=data['cluster_ids'],
    )
    assert len(model.bse) == 1
    return model.bse[0]

def sem(numbers: list[float]) -> float:
    """Standard error of the mean (with small sample correction)"""
    squared_deviations = [(x - np.mean(numbers)) ** 2 for x in numbers]
    small_sample_correction = len(numbers) / (len(numbers) - 1)
    v = np.mean(squared_deviations)
    v *= small_sample_correction
    return np.sqrt(v / len(numbers))

def sem_of_cluster_means(data: pd.DataFrame) -> float:
    """Standard error of the mean of cluster means"""
    cluster_means = data.groupby('cluster_ids')['y'].mean()
    return sem(cluster_means)

# Create the dataset
data = pd.DataFrame({
    "y": [1.0, 4.0, 11.0, 6.0, 13.0, 8.0],
    "cluster_ids": [1, 1, 2, 2, 3, 3]
})

print(cluster_se(data))
print(sem_of_cluster_means(data))

tadamcz · 2025-01-13T20:07:47Z

@dragonstyle since you're the author of the previous stderr implementation, can you comment on whether this was actually intentional?

tadamcz · 2025-01-13T20:26:04Z

I think the two expressions are equivalent, but this means I am once again confused as to what is the correct approach for modelling this problem. i.e. what is the correct overall variance metric to report? It seems clearly wrong to ignore LLM sampling uncertainty, but it looks like we are doing that here.

This is not my area of expertise, but the question seems quite important. If statisticians want to weigh in that would be appreciated.

tadamcz · 2025-01-14T17:32:46Z

Closing in favour of: #1118

tadamcz added 6 commits January 13, 2025 14:29

clustered standard errors

daf2319

add small-sample correction, first tests

c0597df

add additional tests

c85bc8a

fix reducer test

e76bf24

change Metric protocol to operate on ReducedScores

50fde2d

add reference to Miller paper

561ff27

tadamcz force-pushed the clustered-stderr branch from 729b6db to 1318bdf Compare January 13, 2025 15:52

mollify ruff

a7bd7e1

tadamcz force-pushed the clustered-stderr branch from 1318bdf to a7bd7e1 Compare January 13, 2025 16:10

evanmiller-anthropic reviewed Jan 13, 2025

View reviewed changes

dragonstyle reviewed Jan 13, 2025

View reviewed changes

tadamcz closed this Jan 14, 2025

evanmiller-anthropic mentioned this pull request Jan 16, 2025

Account for LLM sampling uncertainty via hierarchical bootstrap #1118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce clustered (and heteroskedasticity-robust) standard errors #1113

Introduce clustered (and heteroskedasticity-robust) standard errors #1113

tadamcz commented Jan 13, 2025 •

edited

Loading

evanmiller-anthropic Jan 13, 2025

evanmiller-anthropic Jan 13, 2025

dragonstyle Jan 13, 2025

tadamcz Jan 13, 2025

tadamcz commented Jan 13, 2025 •

edited

Loading

tadamcz commented Jan 13, 2025

tadamcz commented Jan 13, 2025 •

edited

Loading

tadamcz commented Jan 14, 2025

Introduce clustered (and heteroskedasticity-robust) standard errors #1113

Introduce clustered (and heteroskedasticity-robust) standard errors #1113

Conversation

tadamcz commented Jan 13, 2025 • edited Loading

Motivation and previous behaviour

Changes to types

Tests

evanmiller-anthropic Jan 13, 2025

Choose a reason for hiding this comment

evanmiller-anthropic Jan 13, 2025

Choose a reason for hiding this comment

dragonstyle Jan 13, 2025

Choose a reason for hiding this comment

tadamcz Jan 13, 2025

Choose a reason for hiding this comment

tadamcz commented Jan 13, 2025 • edited Loading

tadamcz commented Jan 13, 2025

tadamcz commented Jan 13, 2025 • edited Loading

tadamcz commented Jan 14, 2025

tadamcz commented Jan 13, 2025 •

edited

Loading

tadamcz commented Jan 13, 2025 •

edited

Loading

tadamcz commented Jan 13, 2025 •

edited

Loading