Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-filtering by number of events #53

Closed
JasonTam opened this issue Feb 1, 2019 · 2 comments
Closed

Pre-filtering by number of events #53

JasonTam opened this issue Feb 1, 2019 · 2 comments

Comments

@JasonTam
Copy link

JasonTam commented Feb 1, 2019

HI, I'm fairly new to this area, and I just wanted a sanity check to see if it makes sense to pre-filter a dataset based on number of events. For example, remove all users with less than k events in the observation period.

I can see this making sense with k=1 since we tend to drop the first event for all sequences anyway (#37 (comment)). Of course this might depend on the dataset, and I plan to play around with it. However, I just wanted to know if it was maybe common practice to drop records like this. Or do we favor keeping all users so we can learn user-features correlated with single event->churn

Thanks

@ragulpr
Copy link
Owner

ragulpr commented Feb 13, 2019

Very good question. Whatever you do, it leads to its specific little bias. I think it is a very common practice, and I don't think people are aware of how biasing it is.

In order to have to think about it the least, think about who and when you want to predict.
Keeping training-set identical to prediction-dataset (i.e keep all users) will save you a lot of headache. That's my general advice.

For example, if you would like to predict the time to next event today for all those users who has ever been active?
Then each days (even empty days) should probably be represented in the training-datasets.

If you train on users who has had at least 2 events in the past 60 days, then at t=0 the initial prediction the model should learn to make is that a user has probability=1 of having an event within 60 days, i.e it'll learn Pr(Y_0 < 60)=1. As you get closer to the end of the dataset, this should hold for smaller lookaheads to,
i.e Pr(Y_30 < 30)=1 if there was less than 2 events in the first 30 days. In other words; you have learnt another query than you intented:

Pr(Y_t<y) = probability of having an event within y days given that they will have had at least 2 events in 60-t days,

So through your datamunging you're actually conditioning on the future instead of predicting the future :D

If it's impossible to let dataset represent all sequences since we started recording (which is the best), I think it makes more sense to have a look-back query of something like this;

SELECT 
	id,
	DATE(timestamp) as date 
	count(*) as n_events,
	...
FROM
	PAYMENTS
WHERE 
	date>today-60
GROUP BY
	id,date

The query you would be training for is then

Pr(Y_t<y) = probability of having an event within y days given that we've seen them in the past 60-t days

Which is fine in the sense that, you'll make a prediction after their first event, so the filtering of the data doesn't reveal the future. It may raise its own paculiar questions[0], but the problems are much less apparent.

I haven't codified some catch-all solution to solve the problem of, say, 99% of users just arriving once (which causes very high sparsity and alot of data).

To calm your worries, It's not a WTTE-specific problem and I think these types of biases are present in many machine learning systems and they work anyway through blissfull ignorance.

[0] Ex, those that had events in the first days of this query are probably those sequences with many events in general. What kinds of biases does this induce? Also, there may be entrants into this dataset who haven't been active for more than 60 days. Does this cause problems?

TL:DR, try not to but it's complex.

@JasonTam
Copy link
Author

Thanks for the detailed response! It was really helpful :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants