-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A - Feat/machine learning #422
base: dev
Are you sure you want to change the base?
Conversation
Tanguylo
commented
Nov 8, 2022
- What kind of changes does this PR introduce?
- new features
- Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
- No
- Other information:
- Implements sklearn objects for modeling
- RandomForest, MLP, Ridge, SVM
- Scalers
This PR needs some doc writing but code is ready for review :) |
dessia_common/datatools/modeling.py
Outdated
return scaler | ||
|
||
@classmethod | ||
def instantiate_dessia(cls, scaler, name: str = ''): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wish to type scaler with the BaseScaler type, consider using the forward reference typing :
def method(cls, scaler : 'BaseScaler', name: str = ""):
This works perfectly fine and is supported by dessia_common. I don't know if this is what you intend to do, though.
Just a comment anyway, if the method is not meant to be ran on platform that's purely optional and for documentation purpose (for example, it would have help understand this bit of code faster :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunatly, the scaler here is the scikit-learn object from sklearn.preprocessing.StandardScaler
, which cannot be serialized.
Maybe I should write it ?
def method(cls, scaler : preprocessing.StandardScaler, name: str = ""):
And the same for models below ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could. Is this function meant to be called from the platform ? (and thus, be opened in a form) In this case, preprocessing.StandardScaler cannot work as you need typings supported by dessia_common.
In the other case, its just for doc purpose and not mandatory at all. Actually sometimes I even advise against it, so I let you do whatever you prefer on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
methods instantiate are not supposed to be used in platform, so the typing is for doc purpose only I guess.
I'm ok with setting types from sklearn, for the sake of readability and simplicity, so i'll do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some reflexion and a try it appears that the sklearn scaler expected here has not a predefined type. It can be StandardScaler, LinearScaler, 'HardName',...
So I decide not to set it, except if i can write this :
def instantiate_dessia(cls, scaler: preprocessing, name: str = '')
which i guess is not possible since preprocessing is a module...
Are you ok with it ?
dessia_common/datatools/modeling.py
Outdated
self.tree_state = tree_state | ||
BaseModel.__init__(self, name=name) | ||
|
||
def _data_hash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Red light is turning on here ! data_hash is overwritten but not data_eq. Why is that ?
It might be perfectly all right, but I feel obliged to ask because that might be "dangerous" as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, i did not want to rewrite anything of these methods but i have encountered some performance problems when calling check_platform method for RandomForest.
RandomForest is a list of DecisionTree which contain BaseTree.
So, when running the check_platform, RandomForest could take some 45 seconds to be checked (n_estimators_ x DecisionTree, each containing a BaseTree). I think it is too long. So I made a profiling, which sent me to the data_hash method for BaseTree.
After some investigations, it seems like the data_hash method does not like so much dict of list of list, from the tree_state attribute.
So I just changed it dumbly without questioning anything else, counting on you to correct me if i'm wrong.
Do you suggest me to also rewrite the data_eq method for BaseTree ?
Why ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you got the right call ! That is exactly the point of it.
Simply put, the hash and eq function are a bit alike. Hash is a "quick and efficient" eq that does not guarantee equality, but does garantee non-equality.
So basically, what you are doing when you a == b
, is that you are evaluating the hash function of both and compare the result, which are integers. If they are different, it is sufficient to guarantee that a and b are different. On the other, if they are equal, it is necessary for a and b to be equal. In this case, you evaluate the __eq__
function which can be costly.
It becomes handy when you want to check if a in massive_sequence:
, and only compute the hashes of the elements without even caring running the __eq__
function on every sequence element. You can gain a massive amount of time.
Formerly, we overwrote the __hash__
and __eq__
function of python, but now we have data_hash
and data_eq
, which is basically the same. Actually, we do overwrite the eq/hash functions just to call data_eq
and data_hash
, if ever the eq_is_data_eq
class variable is True.
All of this means that the __eq__
and __hash__
functions must be consistent and that is what I am trying to figure out. This is "dangerous" because you could miss equal objects if ever the hashes computed for two "considered-equal" objects are different. For example, this is a really bad hash/eq combination :
def __init__(self, a, b):
self.a = a
self b = b
def __hash__(self):
return hash(self.a) + hash(self.b)
def __eq__(self, other):
return self.a == other.a
Two object will be equal or not regarding a but the hash uses b to compute. So we would miss the equality if b attributes give different hashes.
If hash and eq are inconsistent, at best you'll introduce performance issues, at worst you'll miss equalities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if i have a._data_hash() == b._data_hash()
the next step is necessarily to check a.__eq__(b)
, isn't it ?
If yes, what i wrote is ok for your warning
But I could maybe improve the eq method to get it faster.
But i find it dangerous to do it for objects i did not build myself.
…_common into feat/machine_learning
…le error in kmeans
…be modified in another PR
…l to scaler in it
…nd write or improve docstrings for these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really solid ! A pleasure to read
dessia_common/datatools/modeling.py
Outdated
self.tree_state = tree_state | ||
BaseModel.__init__(self, name=name) | ||
|
||
def _data_hash(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you got the right call ! That is exactly the point of it.
Simply put, the hash and eq function are a bit alike. Hash is a "quick and efficient" eq that does not guarantee equality, but does garantee non-equality.
So basically, what you are doing when you a == b
, is that you are evaluating the hash function of both and compare the result, which are integers. If they are different, it is sufficient to guarantee that a and b are different. On the other, if they are equal, it is necessary for a and b to be equal. In this case, you evaluate the __eq__
function which can be costly.
It becomes handy when you want to check if a in massive_sequence:
, and only compute the hashes of the elements without even caring running the __eq__
function on every sequence element. You can gain a massive amount of time.
Formerly, we overwrote the __hash__
and __eq__
function of python, but now we have data_hash
and data_eq
, which is basically the same. Actually, we do overwrite the eq/hash functions just to call data_eq
and data_hash
, if ever the eq_is_data_eq
class variable is True.
All of this means that the __eq__
and __hash__
functions must be consistent and that is what I am trying to figure out. This is "dangerous" because you could miss equal objects if ever the hashes computed for two "considered-equal" objects are different. For example, this is a really bad hash/eq combination :
def __init__(self, a, b):
self.a = a
self b = b
def __hash__(self):
return hash(self.a) + hash(self.b)
def __eq__(self, other):
return self.a == other.a
Two object will be equal or not regarding a but the hash uses b to compute. So we would miss the equality if b attributes give different hashes.
If hash and eq are inconsistent, at best you'll introduce performance issues, at worst you'll miss equalities
dessia_common/datatools/modeling.py
Outdated
@staticmethod | ||
def _getstate_dessia(model): | ||
state = model.__getstate__() | ||
dessia_state = {'max_depth': int(state['max_depth'])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these functions are costly, consider affecting in one-go instead of defining and "appending" other key-values to dessia_state :
dessia_state = {"max_depth": int(state["max_depth"],
"node_count": int(state["node_count"],
"values": state['values'].tolist(),
....}
This way, you would already "take all the space on a disk" instead of having to change your object and eventually have to relocate the object in memory, which hurts performance on the large scale (same goes form list appending, it is even worse).
You could also write a DessiaState (maybe with another name, why not juste State ?) class that would instantiate itself from this state dict object.
This is True also for the next _setstate_dessia, and you could have an import/export behavior of some sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, now that I think about it, I am not that sure about this last comment about State being set as a class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true for all dict. Should i do for every other dict (kwargs ones) in this module ?
About the State, i'm not sure it is useful since it is a requirement from sklearn CTree (the C version of the DecisionTree) and has just been a painful operation and moment to make it work with our constraints...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed all dicts and i keep state as it is now.
…gin with linearregression docstring
…ollowing review comment
…is covered at least 95%
Lowered coverage note because developments of this PR are widely enough covered and this PR should not be blocked by coverage problems of other scripts. |
tests/dataset.py
Outdated
@@ -30,8 +30,6 @@ def prop1(self): | |||
assert(bidon_hlist.common_attributes == ['attr1']) | |||
|
|||
# Tests on common_attributes | |||
|
|||
|
|||
class Bidon(DessiaObject): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bidon is a bit weird as an english name. Would "Dummy" be a better wording ?
In addition to that, is the class defined twice in the module ? its defined on lines 14 and 33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in #561 (dataset deep_attribute)
tests/datatools_modeler.py
Outdated
@@ -0,0 +1,150 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of commented code in this file. Are these necessary ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script file has been fully reviewed for better reading and UX => small refactor implied
dessia_common/datatools/dataset.py
Outdated
|
||
def train_test_split(self, ratio: float = 0.8, shuffled: bool = True) -> List[Matrix]: | ||
""" Generate train and test Datasets from current Dataset. """ | ||
ind_train, ind_test = models.get_split_indexes(len(self), ratio=ratio, shuffled=shuffled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ind
denotes index
, we should either call it i
or index
to avoid truncated variable names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dessia_common/datatools/dataset.py
Outdated
def train_test_split(self, ratio: float = 0.8, shuffled: bool = True) -> List[Matrix]: | ||
""" Generate train and test Datasets from current Dataset. """ | ||
ind_train, ind_test = models.get_split_indexes(len(self), ratio=ratio, shuffled=shuffled) | ||
return Dataset(self[ind_train], name=self.name + '_train'), Dataset(self[ind_test], name=self.name + '_test') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function return is typed as a List[Matrix], but function returns a tuple.
In addition to that, consider naming all arguments when a function/method/class needs more than one :
foo = OneArgClass(whatever)
bar = SeveralArgsClass(first_arg=something, second_arg=something_else, name="stuff")
This is incredibily useful when refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
dessia_common/utils/helpers.py
Outdated
@@ -47,6 +48,17 @@ def prettyname(name: str) -> str: | |||
pretty_name += ' ' | |||
return pretty_name | |||
|
|||
def maximums(matrix: List[List[float]]) -> List[float]: | |||
""" Compute maximum values and store it in a list of length `len(matrix[0])`. """ | |||
if not isinstance(matrix[0], list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually dangerous specific isinstance list check here. Is it the only exact type you want to handle or you also want to capture other iterables (Tuple, ...) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove input typing but did not move it into datatools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally moved in math.py in datatool which is the former metrics.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And put union type of vector and matrix, that could be objects with inheritance rules, if time was like in the magic room of dragon ball
|
||
def _data_hash(self): | ||
hash_ = npy.linalg.norm(self.tree_state['values'][0]) | ||
hash_ += sum(self.n_classes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the sum of the list an efficient-enough hash for your case ? Don't you have too much collision and dispersion issues ?
Ideas for improving if this is not the case :
- Multiply it by its length
- Weight some values by a prime number.
- Check the generic dessia sequence hash to see if it suits your needs (Uses first and last elements for hash generation, soon to be improved with controlled randomness)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, i don't really care. The purpose is to quickly eliminate non equal trees, because their == is long (because they are list of list of list).
Are you ok with my answer ?
|
||
@classmethod | ||
def _skl_class(cls): | ||
return tree._tree.Tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert wide thinking emoji
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not my stuff ;)
return self._skl_class()(self.n_features, npy.array(self.n_classes), self.n_outputs) | ||
|
||
@staticmethod | ||
def _getstate_dessia(model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these functions be explicitly named after dessia ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed dessia
|
||
@classmethod | ||
def _check_criterion(cls, criterion: str): | ||
if 'egressor' not in cls.__name__ and criterion == 'squared_error': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"egressor" check looks unsafe (in the case of a refactor, for ex). Can't we use Subclass stuff ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
'parameters': parameters} | ||
|
||
@classmethod | ||
def _check_criterion(cls, criterion: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate bit of code. Could it be set in Model class to mutualize ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…for training people and test scripts