Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta-labelling #27

Open
PeterAVG opened this issue Mar 18, 2019 · 24 comments
Open

Meta-labelling #27

PeterAVG opened this issue Mar 18, 2019 · 24 comments

Comments

@PeterAVG
Copy link

If one uses eg. a RF model to learn the side (instead of a simple moving average model), then how is meta-labelling carried out? Because we train the 1st primary model (RF) on X_train, do we also want to predict on this same X_train to know the side in order to train a 2nd model? This strategy is employed by Jaques Joubert (@Jackal08) in his notebook on Quantopian on the MNIST data set (as weel as on a validation and hold-out set). However, the 1st model's accuracy is then heavily inflated which I think would propagate to the 2nd model.

Further, Prado suggests that one can change the horizontal barriers when meta-labelling but I can't see why that makes sense at all since the 1st model is trained on one particular configuration of barriers and if meta-label according to a different configuration, then the 1st model is not evaluated properly. Has anyone any ideas as to why Prado suggests this?

@rspadim
Copy link

rspadim commented Mar 18, 2019

P(Y1|X) - first model

P(Y2|X) - second model

no problem, you couldn't do P(Y2| X, Y1) cause Y1 is fitted to Y real values that Y2 will fit too in this case you could contamine the predicted value with Ŷ1 input feature

@PeterAVG
Copy link
Author

So does that mean that you can't use the predictions from the 1st model as a new feature to X as Jackal08 did? Because otherwise, the problem is exactly the same, just with {0,1} as targets instead of {-1,1}, right?

@rspadim
Copy link

rspadim commented Mar 18, 2019

no, what i saw:

model1.fit(X,Y1)

model2.fit([X, model1.predict(X)], Y2) <- you can't do this without caution

model2.fit(X, Y2) <- you can do this without problems

@PeterAVG
Copy link
Author

PeterAVG commented Mar 18, 2019

Which means that one should not use model1 predictions as a feature to the 2nd model - which is contrary to what I see in papers and blogs. But it guess it makes sense because if the 1st model gets 90 % accuracy on its own training data (which isn't unrealistic), then that feature will have high predictive power that propagates to the 2nd model.

@rspadim
Copy link

rspadim commented Mar 18, 2019

the problem of this:
model2.fit([X, model1.predict(X)], Y2) <- you can't do this without caution

is model1.predict(X) become to equal to Y2, and you have a data contamination, that's the main problem, if model1.predict(X) don't predict Y2 "easily" (i'm tryingt to remember the right formal word here) you can use without problem, it's like the idea of stacking models

@PeterAVG
Copy link
Author

I agree, when the predictions from the 1st model are not inflated too much, then it is possible to use them as predictions as the meta-labels are not imbalanced. However, I am very confused about Prado's intentions regarding meta-labelling still.

@rspadim
Copy link

rspadim commented Mar 18, 2019

there're two predictions, size and side, each one you can model with any X value, the point is avoid data contamination

@PeterAVG
Copy link
Author

Yes I am completely aware of that. It is just difficult for me to understand conceptually what data the 2nd model should be trained on in relation to the 1st models predictions. As I see it, one cannot get around the fact that we have to predict on the X_train which may or may not be used as an additional feature. The whole problem for me is that the it the predictions on the 1st model are highly inflated which I simply can't see is useful in a 2nd model..

@Jackal08
Copy link

Jackal08 commented Mar 18, 2019

Ah ok I think I see where the confusion comes in.

So first of all the main idea behind meta labeling is that there are probably a set of features for which the primary model is likely to be wrong. In the example of a trading strategy, lets say you are using a trending strategy like a moving avg cross over. If markets stop trending then the strategy is likely to not perform. This can be picked up by using features like momentum, serial correlation, and volatility.

The meta model would learn which features lead to the primary model being wrong.

Now with a technical or discretionary PM strategy its easy because we have the sides provided upfront. We can easily just train a model on those signals and viola. However lets say we have a primary model based on machine learning. This model would need to first be trained on some training and validation data, and then we would need to run it out of sample to get a few scores so that we can build the meta model.

[See my edit at the bottom in reference to this paragraph.]
This is where the confusion comes in. In the MNIST notebook I published I trained the primary model (logistic regression) on the training data but then I used the same data to fit the meta model. I feel that was a mistake. I would not do that in practice. However I am surprised that it worked so well out of sample. Maybe there is something to be said about using training data for both... (again, I think it's better to fit the meta model on data not used for training in the primary model.)

As I am writing this I also realize that if your primary model is an online learning model then it might not work well for meta labeling as its weights are adjusted through time and the meta model may not be able to learn the feature set for which the primary model performs poorly.

I did an extra two notebooks on a trend following and a mean reversion strategy and fit meta labeling and scored out-of-sample. The strategy works as advertised. In both strategies the risk adjusted metrics are better.

There are 5 papers which I think are important to read to get more intuition behind some of the rational of meta labeling:
* Wang, J. Chan, S. 2006. Stock market trading rule discovery using two-layer bias decision tree. Expert Systems with Applications 30 (2006) 605–611
* Qin, Q. Wang, Q. Li, J. Sam Ge, S. 2013. Linear and Nonlinear Trading Models with Gradient Boosted Random Forests and Application to Singapore Stock Market. Journal of Intelligent Learning Systems and Applications, 2013, 5, 1-10
* Tsai, C.F. and Wang, S.P., 2009, March. Stock price forecasting by hybrid machine learning techniques. In Proceedings of the International MultiConference of Engineers and Computer Scientists (Vol. 1, No. 755, p. 60).
* Patel, J., Shah, S., Thakkar, P. and Kotecha, K., 2015. Predicting stock market index using fusion of machine learning techniques. Expert Systems with Applications, 42(4), pp.2162-2172.
* Zhu, M., Philpotts, D., Sparks, R. and Stevenson, M.J., 2011. A hybrid approach to combining CART and logistic regression for stock ranking. Journal of Portfolio Management, 38(1), p.100.

I hope this helps.

[Edit]
Oh wait I see page 54 says the following:

  1. Use your forecasts from the primary model, and generate meta-labels. Remember, horizontal barriers do not need to be symmetric in this case.
  2. Fit your model again on the same training set, but this time using the meta-labels you just generated.
  3. Combine the “sides” from the first ML model with the “sizes” from the second ML model.

I'd love to open up a dialog about this! What are your thoughts?

@Jackal08
Copy link

If one uses eg. a RF model to learn the side (instead of a simple moving average model), then how is meta-labelling carried out? Because we train the 1st primary model (RF) on X_train, do we also want to predict on this same X_train to know the side in order to train a 2nd model? This strategy is employed by Jaques Joubert (@Jackal08) in his notebook on Quantopian on the MNIST data set (as weel as on a validation and hold-out set). However, the 1st model's accuracy is then heavily inflated which I think would propagate to the 2nd model.

Further, Prado suggests that one can change the horizontal barriers when meta-labelling but I can't see why that makes sense at all since the 1st model is trained on one particular configuration of barriers and if meta-label according to a different configuration, then the 1st model is not evaluated properly. Has anyone any ideas as to why Prado suggests this?

Sorry I can't find this reference. On what page does it say that you can change the barriers after a primary model has determined the side?

@rspadim
Copy link

rspadim commented Mar 18, 2019

Yes I am completely aware of that. It is just difficult for me to understand conceptually what data the 2nd model should be trained on in relation to the 1st models predictions. As I see it, one cannot get around the fact that we have to predict on the X_train which may or may not be used as an additional feature. The whole problem for me is that the it the predictions on the 1st model are highly inflated which I simply can't see is useful in a 2nd model..

The point is: Model1 use more some features, Model2 use others, features are same, but importance are diferent

@lionelyoung
Copy link
Contributor

Hi all,

I'd like to add to the dialog with my interpretation of the reading, which shows that we can use different features:

  • We may use the same X_train for both side and size (applies to all four reasons below)
  • We do not have to use the same X_train for both side and size (see reason 3 below)

MLdP gives four reasons about the benefits of meta-labeling which I will paraphrase below, comparing with and without meta-labeling:

  1. Meta-label gives explainability by separating the environment analysis and execution into two steps:
  • With meta-labeling: Provides a "filter and a trigger" for the trade
  • Without meta-labeling (side prediction): Forces an answer ("must trade")
  1. Meta-labeling reduces effects of overfitting, because the "trigger" decides the size of the bet only
  • Without meta-labeling: Overfit because you'd get a side-prediction and it implies a trade on every prediction
  • With meta-labeling: Let's you improve the F1 score of a "profitable trade" by giving a size after reducing the false positives from predicting the side
  1. Meta-labeling decouples the side {-1,1} from the size {0,1} prediction let's you recombine strategies in different ways. For example:
  • Without meta-labeling:
    • Side: ModelA+ModelB features all together to predict the side {-1, 1}
  • With meta-labeling:
    • Side: ModelA predicts the longs (side {-1,1} and then drop the -1's)
    • Side: ModelB predicts the shorts (side {-1, 1} and then drop the 1's)
    • Size: Meta-labeling can combine them for a trade sizing
  1. Emphasizes position sizing as a best practice, because it meta-labeling forces a classification algorithm for the size. To elaborate, with meta-labeling:
  • Without meta-labeling: Side-prediction: the "filter" that breaks down the environment to predict outcomes
  • With meta-labeling: the "trigger" that tells you whether it's worth it or not to take the trade

@PeterAVG
Copy link
Author

@Jackal08 You referenced the critical part yourself:

  1. Use your forecasts from the primary model, and generate meta-labels. Remember, horizontal barriers do not need to be symmetric in this case.
  2. Fit your model again on the same training set, but this time using the meta-labels you just generated.
  3. Combine the “sides” from the first ML model with the “sizes” from the second ML model.

Thus, I see two issues where I am not clarified yet: if one chooses a ML model to determine sides then why use different barriers for meta-labelling? And what is best practice for training that model in relation to the 2nd ML model.

I agree that a MAvg model to learn sides simplifies the problem completely but I am sure there is a lot to be gained from having ML models as both the 1st and 2nd model. I am going to re-read the papers more thoroughly as well but my initial implementation is as follows:

  1. Train ML model 2 to learn sides on X_train. Predict on X_train but make sure the forecasts are not too good, otherwise imbalance will be present in the meta-labels (more 1s than 0s).
  2. Train ML model 2 to learn meta-labels (bet size) on X_train+predictions from 1st model.
  3. Predict on test set, apply trading strategy etc, backtest through CPVC etc.

@PeterAVG
Copy link
Author

@lionelyoung Yes, I agree with all your points. The discussion is more about how to apply meta-labelling. Because you already use 'normal' labelling using barriers for to train ML model 1 to learn the side then - in my mind - it doesn't not make sense to apply meta-labelling while changing the barriers (as Prado suggests).

Instead, I would simply construct the set {0,1} directly from the predictions from the 1st model.

@lionelyoung
Copy link
Contributor

lionelyoung commented Mar 19, 2019

Hi @PeterAVG,

I believe we're talking about this line, correct?

Use your forecasts from the primary model, and generate meta-labels. Remember, horizontal barriers do not need to be symmetric in this case.

My interpretation is that this is only for for Model1 (side-learning) -- specifically that the horizontal barriers for long & short (upper and lower) can be different. For example, horizontal barrier for long (upper barrier) is 10 points away, and horizontal barrier for short (lower barrier) is 5 points away.

Subsequently, Model2 (size-learning) will use Model1's barriers, after meta-labeling for (selecting only the "side" in the correct direction -- for positive returns, multiplying the signed side * signed returns)

Edit: I'd like to clarify that the "signed returns" is a function of the upper and lower barriers, and since signed returns are used in the meta-labeling, Model2 after meta-labeling effectively is using Model1's barriers

@Jackal08
Copy link

Jackal08 commented Mar 19, 2019

Hi @PeterAVG,

I believe we're talking about this line, correct?

Use your forecasts from the primary model, and generate meta-labels. Remember, horizontal barriers do not need to be symmetric in this case.

My interpretation is that this is only for for Model1 (side-learning) -- specifically that the horizontal barriers for long & short (upper and lower) can be different. For example, horizontal barrier for long (upper barrier) is 10 points away, and horizontal barrier for short (lower barrier) is 5 points away.

Subsequently, Model2 (size-learning) will use Model1's barriers, after meta-labeling for (selecting only the "side" in the correct direction -- for positive returns, multiplying the signed side * signed returns)

Edit: I'd like to clarify that the "signed returns" is a function of the upper and lower barriers, and since signed returns are used in the meta-labeling, Model2 after meta-labeling effectively is using Model1's barriers

I agree with Lionel, this is how I understand it.

@rspadim
Copy link

rspadim commented Mar 19, 2019

I don't know if every body here like derivatives, but the first idea about size betting i read was about using a binary option to model the size (and side)

the idea is the same, size betting + side, there's nothing different from this. in option you have strike difference and right (payoff and call/put) to calculate the side and bet size, it's not a new idea it's a better explained idea by prado using ML models (that really makes senses and works)

@Jackal08
Copy link

@rspadim I would be very interested in reading more about this. Is there a book or a paper you would recommend that covers the topic?

@rspadim
Copy link

rspadim commented Mar 19, 2019

the idea is good, but the model isn't good, you cannot predict future with volatility and underling model as brownian motion (basic information to create a derivative formula), it simply doesn't work, the fundamental idea is old, browne did it. i don't know if you want loose time with it, anyway browne betting size idea with digital options

@rspadim
Copy link

rspadim commented Mar 19, 2019

@PeterAVG
Copy link
Author

Great discussion. I might have misunderstood the code and comments from Prado - what you wrote @lionelyoung makes sense. It is an interesting topic and I am trying to apply a 2nd ML model on top of a 1st one instead of just using a simple MAvg eg.

@mehmetdilek
Copy link

How I have interpreted and implemented meta labels is as follows.

Meta labels are the PNL of the predictions of the first model. It is important to understand that it is markets reaction to the primary model, hence carrying new information.

Hence, in the following setup, model 2 reveals what model 1 was not able to learn in X.

X_meta = X
y_meta = data['meta']
X_train, X_test, y_train, y_test = train_test_split(X_meta, y_meta, test_size=0.5)

@rspadim
Copy link

rspadim commented Mar 20, 2019

@Jackal08 did you liked? =)

@younghoon020
Copy link

Here's my interpretation after reading the book.

In meta-labeling, Y2 is a function of Y1. I keep in mind that machine learning is basically maximum likelihood estimation. In other words, it's a greedy based optimization approach. Since Y2 is a function of Y1, with a greedy (quick and dirty) mindset, the machine learning model would optimize towards simply mimicking Y1's statistic in order to inference Y2. Therefore, training a model with objective P(Y2 | X, Y1) would be redundant in my opinion, as the model would simply rank Y1 as highest in its feature importance. In other words, it's not going to learn new information; it's going to simply follow Y1. This happens all the time in machine learning. Yet again, it's always better to test it out before accepting my guess as reasonable.

The only concern here is that Y2 distribution is going to be more imbalanced than Y1, since we're going to remove false positives by PM. But that's a good thing for us, because the secondary model is going to predict more falses than positives: better safe than sorry in finance, right?

Also, in my experience, we want to include as much data as possible when fitting models with the same reason I stated above: better safe than sorry. So, I would not stack the two models. However, stacking is definitely interesting to test out, because it allows for more aggressive investing. The reason is that stacking limits the secondary model's data that is within the domain of discourse of the primary model. The secondary model is going to aggressively fit, assuming that the primary model's domain of discourse produces alpha. If the primary model does produce significant alpha, the secondary model won't be risky. But we don't know for sure if the primary model does produce alpha in real life. So, again, better safe than sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants