-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meta-labelling #27
Comments
P(Y1|X) - first model P(Y2|X) - second model no problem, you couldn't do P(Y2| X, Y1) cause Y1 is fitted to Y real values that Y2 will fit too in this case you could contamine the predicted value with Ŷ1 input feature |
So does that mean that you can't use the predictions from the 1st model as a new feature to X as Jackal08 did? Because otherwise, the problem is exactly the same, just with {0,1} as targets instead of {-1,1}, right? |
no, what i saw: model1.fit(X,Y1) model2.fit([X, model1.predict(X)], Y2) <- you can't do this without caution model2.fit(X, Y2) <- you can do this without problems |
Which means that one should not use model1 predictions as a feature to the 2nd model - which is contrary to what I see in papers and blogs. But it guess it makes sense because if the 1st model gets 90 % accuracy on its own training data (which isn't unrealistic), then that feature will have high predictive power that propagates to the 2nd model. |
the problem of this: is model1.predict(X) become to equal to Y2, and you have a data contamination, that's the main problem, if model1.predict(X) don't predict Y2 "easily" (i'm tryingt to remember the right formal word here) you can use without problem, it's like the idea of stacking models |
I agree, when the predictions from the 1st model are not inflated too much, then it is possible to use them as predictions as the meta-labels are not imbalanced. However, I am very confused about Prado's intentions regarding meta-labelling still. |
there're two predictions, size and side, each one you can model with any X value, the point is avoid data contamination |
Yes I am completely aware of that. It is just difficult for me to understand conceptually what data the 2nd model should be trained on in relation to the 1st models predictions. As I see it, one cannot get around the fact that we have to predict on the X_train which may or may not be used as an additional feature. The whole problem for me is that the it the predictions on the 1st model are highly inflated which I simply can't see is useful in a 2nd model.. |
Ah ok I think I see where the confusion comes in. So first of all the main idea behind meta labeling is that there are probably a set of features for which the primary model is likely to be wrong. In the example of a trading strategy, lets say you are using a trending strategy like a moving avg cross over. If markets stop trending then the strategy is likely to not perform. This can be picked up by using features like momentum, serial correlation, and volatility. The meta model would learn which features lead to the primary model being wrong. Now with a technical or discretionary PM strategy its easy because we have the sides provided upfront. We can easily just train a model on those signals and viola. However lets say we have a primary model based on machine learning. This model would need to first be trained on some training and validation data, and then we would need to run it out of sample to get a few scores so that we can build the meta model. [See my edit at the bottom in reference to this paragraph.] As I am writing this I also realize that if your primary model is an online learning model then it might not work well for meta labeling as its weights are adjusted through time and the meta model may not be able to learn the feature set for which the primary model performs poorly. I did an extra two notebooks on a trend following and a mean reversion strategy and fit meta labeling and scored out-of-sample. The strategy works as advertised. In both strategies the risk adjusted metrics are better.
There are 5 papers which I think are important to read to get more intuition behind some of the rational of meta labeling: I hope this helps. [Edit]
I'd love to open up a dialog about this! What are your thoughts? |
Sorry I can't find this reference. On what page does it say that you can change the barriers after a primary model has determined the side? |
The point is: Model1 use more some features, Model2 use others, features are same, but importance are diferent |
Hi all, I'd like to add to the dialog with my interpretation of the reading, which shows that we can use different features:
MLdP gives four reasons about the benefits of meta-labeling which I will paraphrase below, comparing with and without meta-labeling:
|
@Jackal08 You referenced the critical part yourself:
Thus, I see two issues where I am not clarified yet: if one chooses a ML model to determine sides then why use different barriers for meta-labelling? And what is best practice for training that model in relation to the 2nd ML model. I agree that a MAvg model to learn sides simplifies the problem completely but I am sure there is a lot to be gained from having ML models as both the 1st and 2nd model. I am going to re-read the papers more thoroughly as well but my initial implementation is as follows:
|
@lionelyoung Yes, I agree with all your points. The discussion is more about how to apply meta-labelling. Because you already use 'normal' labelling using barriers for to train ML model 1 to learn the side then - in my mind - it doesn't not make sense to apply meta-labelling while changing the barriers (as Prado suggests). Instead, I would simply construct the set {0,1} directly from the predictions from the 1st model. |
Hi @PeterAVG, I believe we're talking about this line, correct?
My interpretation is that this is only for for Model1 (side-learning) -- specifically that the horizontal barriers for long & short (upper and lower) can be different. For example, horizontal barrier for long (upper barrier) is 10 points away, and horizontal barrier for short (lower barrier) is 5 points away. Subsequently, Model2 (size-learning) will use Model1's barriers, after meta-labeling for (selecting only the "side" in the correct direction -- for positive returns, multiplying the signed side * signed returns) Edit: I'd like to clarify that the "signed returns" is a function of the upper and lower barriers, and since signed returns are used in the meta-labeling, Model2 after meta-labeling effectively is using Model1's barriers |
I agree with Lionel, this is how I understand it. |
I don't know if every body here like derivatives, but the first idea about size betting i read was about using a binary option to model the size (and side) the idea is the same, size betting + side, there's nothing different from this. in option you have strike difference and right (payoff and call/put) to calculate the side and bet size, it's not a new idea it's a better explained idea by prado using ML models (that really makes senses and works) |
@rspadim I would be very interested in reading more about this. Is there a book or a paper you would recommend that covers the topic? |
the idea is good, but the model isn't good, you cannot predict future with volatility and underling model as brownian motion (basic information to create a derivative formula), it simply doesn't work, the fundamental idea is old, browne did it. i don't know if you want loose time with it, anyway browne betting size idea with digital options |
Great discussion. I might have misunderstood the code and comments from Prado - what you wrote @lionelyoung makes sense. It is an interesting topic and I am trying to apply a 2nd ML model on top of a 1st one instead of just using a simple MAvg eg. |
How I have interpreted and implemented meta labels is as follows. Meta labels are the PNL of the predictions of the first model. It is important to understand that it is markets reaction to the primary model, hence carrying new information. Hence, in the following setup, model 2 reveals what model 1 was not able to learn in X. X_meta = X |
@Jackal08 did you liked? =) |
Here's my interpretation after reading the book. In meta-labeling, Y2 is a function of Y1. I keep in mind that machine learning is basically maximum likelihood estimation. In other words, it's a greedy based optimization approach. Since Y2 is a function of Y1, with a greedy (quick and dirty) mindset, the machine learning model would optimize towards simply mimicking Y1's statistic in order to inference Y2. Therefore, training a model with objective P(Y2 | X, Y1) would be redundant in my opinion, as the model would simply rank Y1 as highest in its feature importance. In other words, it's not going to learn new information; it's going to simply follow Y1. This happens all the time in machine learning. Yet again, it's always better to test it out before accepting my guess as reasonable. The only concern here is that Y2 distribution is going to be more imbalanced than Y1, since we're going to remove false positives by PM. But that's a good thing for us, because the secondary model is going to predict more falses than positives: better safe than sorry in finance, right? Also, in my experience, we want to include as much data as possible when fitting models with the same reason I stated above: better safe than sorry. So, I would not stack the two models. However, stacking is definitely interesting to test out, because it allows for more aggressive investing. The reason is that stacking limits the secondary model's data that is within the domain of discourse of the primary model. The secondary model is going to aggressively fit, assuming that the primary model's domain of discourse produces alpha. If the primary model does produce significant alpha, the secondary model won't be risky. But we don't know for sure if the primary model does produce alpha in real life. So, again, better safe than sorry. |
If one uses eg. a RF model to learn the side (instead of a simple moving average model), then how is meta-labelling carried out? Because we train the 1st primary model (RF) on X_train, do we also want to predict on this same X_train to know the side in order to train a 2nd model? This strategy is employed by Jaques Joubert (@Jackal08) in his notebook on Quantopian on the MNIST data set (as weel as on a validation and hold-out set). However, the 1st model's accuracy is then heavily inflated which I think would propagate to the 2nd model.
Further, Prado suggests that one can change the horizontal barriers when meta-labelling but I can't see why that makes sense at all since the 1st model is trained on one particular configuration of barriers and if meta-label according to a different configuration, then the 1st model is not evaluated properly. Has anyone any ideas as to why Prado suggests this?
The text was updated successfully, but these errors were encountered: