-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-filtering by number of events #53
Comments
Very good question. Whatever you do, it leads to its specific little bias. I think it is a very common practice, and I don't think people are aware of how biasing it is. In order to have to think about it the least, think about who and when you want to predict. For example, if you would like to predict the time to next event today for all those users who has ever been active? If you train on users who has had at least 2 events in the past 60 days, then at t=0 the initial prediction the model should learn to make is that a user has probability=1 of having an event within 60 days, i.e it'll learn
So through your datamunging you're actually conditioning on the future instead of predicting the future :D If it's impossible to let dataset represent all sequences since we started recording (which is the best), I think it makes more sense to have a look-back query of something like this;
The query you would be training for is then
Which is fine in the sense that, you'll make a prediction after their first event, so the filtering of the data doesn't reveal the future. It may raise its own paculiar questions[0], but the problems are much less apparent. I haven't codified some catch-all solution to solve the problem of, say, 99% of users just arriving once (which causes very high sparsity and alot of data). To calm your worries, It's not a WTTE-specific problem and I think these types of biases are present in many machine learning systems and they work anyway through blissfull ignorance. [0] Ex, those that had events in the first days of this query are probably those sequences with many events in general. What kinds of biases does this induce? Also, there may be entrants into this dataset who haven't been active for more than 60 days. Does this cause problems? TL:DR, try not to but it's complex. |
Thanks for the detailed response! It was really helpful :) |
HI, I'm fairly new to this area, and I just wanted a sanity check to see if it makes sense to pre-filter a dataset based on number of events. For example, remove all users with less than
k
events in the observation period.I can see this making sense with
k=1
since we tend to drop the first event for all sequences anyway (#37 (comment)). Of course this might depend on the dataset, and I plan to play around with it. However, I just wanted to know if it was maybe common practice to drop records like this. Or do we favor keeping all users so we can learn user-features correlated with single event->churnThanks
The text was updated successfully, but these errors were encountered: