-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interaction expansion causes major slowdown #35
Comments
After some profiling and debugging, I think I've identified the culprit section of code (part of the internal The TL;DR version is that this part of the code replaces reference levels for the FE factors with NA. However, when two or more factors are fed through the subsequent Example: Say we have two factor-based vectors f1 = [1,2,3,4,5] and f2 = [a,b,c,d,e] that have reference levels "1" and "a", respectively. If we interact them, then we'd ideally want only a single reference case, e.g. "1.a". But what's happening at the moment is that "1.b", "1.c''... "2a", "3a" etc are all getting coded as the reference level too because they contain either "1" or "a". The bottom line, as far as I can tell, is that we end up with a lot of "false" reference cases that later cause a bottleneck when passed to the key PR coming shortly. |
Rerunning the above example with my PR branch:
Created on 2020-10-07 by the reprex package (v0.3.0) Jumps around a bit, but I'm generally seeing a 15-25x improvement for this small(ish) example. FWIW, I've also checked the output and it's the same, both among the three models and across my branch and the CRAN version version. I'd appreciated others kicking the tires, though. |
Here's a gist to test the functionality with a real-world data set. Things look fine (although admittedly this is my very first time using |
I can confirm that @grantmcdermott's fix works for me. I ran a regression with 270,000 observations, 600 clusters, and about 10,000 fixed effects on my Windows computer. Most recent version of |
Just came across this issue (thanks to @reifjulian for the prompt).
The TL;DR version is that using interaction term expansion --- i.e.
f1*f2
, or evenf1:f2
--- in the FE slot causes a major slowdown. The latter is faster than the former, but still significantly slower than creating the interaction outside of thefelm()
call.In the reprex below, I'm using an IV regression (adapted from the docs) since that's the use-case we've been troubleshooting. But I've tested a non-IV example and the effect is the same. From my limited testing, the relative disparities also appear to increase as the data get bigger.
PS.
felm()
documentation warns users not to use*
expansion in the FE slot. But AFAIK this only applies in cases where both variables have not been specified as factors.Again, the first two cases with internal expansion (especially est1) are much slower than est3, which creates the interaction outside of the
felm()
call.And just to confirm that they're yielding the same output:
Created on 2020-10-05 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: