-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce clustered (and heteroskedasticity-robust) standard errors #1113
Conversation
729b6db
to
1318bdf
Compare
1318bdf
to
a7bd7e1
Compare
cluster_dev[i] * cluster_dev[j] | ||
for i in range(cluster_size) | ||
for j in range(cluster_size) | ||
if i != j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could simplify this code by removing the i != j
condition as well as the separate unclustered_var
term. I separated them in the main text of the paper for clarity, but you can see in the appendix that there's really just one (triple) summation.
clustered_var = unclustered_var + (cluster_covar / (n**2)) | ||
|
||
# Apply small sample correction g/(g-1) to variance | ||
clustered_var *= g / (g - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't in the paper but I think this is right!
@@ -53,6 +53,7 @@ | |||
"bootstrap_std", | |||
"std", | |||
"stderr", | |||
"clustered_stderr", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is extraneous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch!
Thanks for the comments! However, I'm currently wondering if this whole thing might be superfluous because Inspect was already (possibly unwittingly?) implementing something mathematically equivalent 🤦. In which case I apologise for wasting your time. I realised this while trying to create an example where the two give different answers. I'm still a little confused but I think:
This is still feeling counter-intuitive to me. They give the same answer even though the first one seems to ignore the within-cluster variation? But look: import numpy as np
import pandas as pd
import statsmodels.api as sm
def cluster_se(data: pd.DataFrame) -> float:
# Add constant column of 1s.
# We are running a trivial regression where we only estimate the y-intercept
data['constant'] = 1
model = sm.OLS(data['y'], data[['constant']])
model = model.fit().get_robustcov_results(
cov_type='cluster',
groups=data['cluster_ids'],
)
assert len(model.bse) == 1
return model.bse[0]
def sem(numbers: list[float]) -> float:
"""Standard error of the mean (with small sample correction)"""
squared_deviations = [(x - np.mean(numbers)) ** 2 for x in numbers]
small_sample_correction = len(numbers) / (len(numbers) - 1)
v = np.mean(squared_deviations)
v *= small_sample_correction
return np.sqrt(v / len(numbers))
def sem_of_cluster_means(data: pd.DataFrame) -> float:
"""Standard error of the mean of cluster means"""
cluster_means = data.groupby('cluster_ids')['y'].mean()
return sem(cluster_means)
# Create the dataset
data = pd.DataFrame({
"y": [1.0, 4.0, 11.0, 6.0, 13.0, 8.0],
"cluster_ids": [1, 1, 2, 2, 3, 3]
})
print(cluster_se(data))
print(sem_of_cluster_means(data)) |
@dragonstyle since you're the author of the previous |
I think the two expressions are equivalent, but this means I am once again confused as to what is the correct approach for modelling this problem. i.e. what is the correct overall variance metric to report? It seems clearly wrong to ignore LLM sampling uncertainty, but it looks like we are doing that here. This is not my area of expertise, but the question seems quite important. If statisticians want to weigh in that would be appreciated. |
Closing in favour of: #1118 |
This PR corrects the
stderr
metric by introducing clustered standard errors [0].Motivation and previous behaviour
The motivation stems from how Inspect can sample an LLM multiple times (via the
epochs
parameter) for each question in a benchmark.Previously, we aggregated (reduced) each question’s samples into a single observation, and then computed standard errors from these question-level observations as if they were known with certainty.
This approach ignored LLM sampling uncertainty — LLM outputs can vary from one epoch to another for the same question.
With clustered standard errors, we instead think of each benchmark question as a cluster. For
epochs=k
, each cluster hask
observations. Each cluster’s internal variance contributes to the overall standard error.The previous approach assumed that within-cluster variation was zero. Just to be clear, this assumption is also incorrect if
epochs=1
. Ifepochs=1
we cannot estimate the within-cluster variance, but that does not mean we can assume it is zero.The clustered standard error reduces to the heteroskedasticity-robust [1] standard error when
epochs=1
.Changes to types
I introduced a
ReducedScore
class as a subclass ofScore
. AReducedScore
keeps a reference to its “children” (alist[Score]
), i.e. the scores that were used to calculate the reduced score. This is because thestderr
metric needs to have access to the unreduced scores to calculate the clustered standard error.This is not a breaking change to other metrics (e.g.
accuracy
still works without code changes). Metrics can continue to use existingScore
functionality, without using thechildren
field of aReducedScore
.A
Reducer
still takes in aScore
, but now returns aReducedScore
instead of aScore
. TheMetric
protocol (e.g., stderr, std, mean, etc.) now operates on lists ofReducedScore
rather than lists ofScore
.A
ReducedScore
could also be used recursively, e.g. when there are thematically related clusters of questions. This was suggested by Miller (Adding Error Bars to Evals, https://arxiv.org/abs/2411.00640):Tests
I have added tests, including to check that this implementation gets the same numerical results as clustered standard errors in
statsmodels
[2][0] : https://en.wikipedia.org/wiki/Clustered_standard_errors
[1] : https://en.wikipedia.org/wiki/Heteroskedasticity-consistent_standard_errors
[2] : https://github.com/statsmodels/statsmodels