Pre-filtering by number of events #53

JasonTam · 2019-02-01T18:33:00Z

HI, I'm fairly new to this area, and I just wanted a sanity check to see if it makes sense to pre-filter a dataset based on number of events. For example, remove all users with less than k events in the observation period.

I can see this making sense with k=1 since we tend to drop the first event for all sequences anyway (#37 (comment)). Of course this might depend on the dataset, and I plan to play around with it. However, I just wanted to know if it was maybe common practice to drop records like this. Or do we favor keeping all users so we can learn user-features correlated with single event->churn

Thanks

The text was updated successfully, but these errors were encountered:

ragulpr · 2019-02-13T14:39:25Z

Very good question. Whatever you do, it leads to its specific little bias. I think it is a very common practice, and I don't think people are aware of how biasing it is.

In order to have to think about it the least, think about who and when you want to predict.
Keeping training-set identical to prediction-dataset (i.e keep all users) will save you a lot of headache. That's my general advice.

For example, if you would like to predict the time to next event today for all those users who has ever been active?
Then each days (even empty days) should probably be represented in the training-datasets.

If you train on users who has had at least 2 events in the past 60 days, then at t=0 the initial prediction the model should learn to make is that a user has probability=1 of having an event within 60 days, i.e it'll learn Pr(Y_0 < 60)=1. As you get closer to the end of the dataset, this should hold for smaller lookaheads to,
i.e Pr(Y_30 < 30)=1 if there was less than 2 events in the first 30 days. In other words; you have learnt another query than you intented:

Pr(Y_t<y) = probability of having an event within y days given that they will have had at least 2 events in 60-t days,

So through your datamunging you're actually conditioning on the future instead of predicting the future :D

If it's impossible to let dataset represent all sequences since we started recording (which is the best), I think it makes more sense to have a look-back query of something like this;

SELECT 
	id,
	DATE(timestamp) as date 
	count(*) as n_events,
	...
FROM
	PAYMENTS
WHERE 
	date>today-60
GROUP BY
	id,date

The query you would be training for is then

Pr(Y_t<y) = probability of having an event within y days given that we've seen them in the past 60-t days

Which is fine in the sense that, you'll make a prediction after their first event, so the filtering of the data doesn't reveal the future. It may raise its own paculiar questions[0], but the problems are much less apparent.

I haven't codified some catch-all solution to solve the problem of, say, 99% of users just arriving once (which causes very high sparsity and alot of data).

To calm your worries, It's not a WTTE-specific problem and I think these types of biases are present in many machine learning systems and they work anyway through blissfull ignorance.

[0] Ex, those that had events in the first days of this query are probably those sequences with many events in general. What kinds of biases does this induce? Also, there may be entrants into this dataset who haven't been active for more than 60 days. Does this cause problems?

TL:DR, try not to but it's complex.

JasonTam · 2019-02-14T21:44:49Z

Thanks for the detailed response! It was really helpful :)

JasonTam closed this as completed Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-filtering by number of events #53

Pre-filtering by number of events #53

JasonTam commented Feb 1, 2019 •

edited

Loading

ragulpr commented Feb 13, 2019 •

edited

Loading

JasonTam commented Feb 14, 2019

Pre-filtering by number of events #53

Pre-filtering by number of events #53

Comments

JasonTam commented Feb 1, 2019 • edited Loading

ragulpr commented Feb 13, 2019 • edited Loading

JasonTam commented Feb 14, 2019

JasonTam commented Feb 1, 2019 •

edited

Loading

ragulpr commented Feb 13, 2019 •

edited

Loading