Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A - Feat/machine learning #422

Open
wants to merge 315 commits into
base: dev
Choose a base branch
from
Open

A - Feat/machine learning #422

wants to merge 315 commits into from

Conversation

Tanguylo
Copy link
Contributor

@Tanguylo Tanguylo commented Nov 8, 2022

  • What kind of changes does this PR introduce?
  • new features
  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
  • No
  • Other information:
  • Implements sklearn objects for modeling
  • RandomForest, MLP, Ridge, SVM
  • Scalers

@GhislainJ GhislainJ added the Status: In progress Dev team is currently working on this label Nov 10, 2022
@Tanguylo
Copy link
Contributor Author

This PR needs some doc writing but code is ready for review :)

@Tanguylo Tanguylo marked this pull request as ready for review November 30, 2022 14:58
@Tanguylo Tanguylo added Status: Ready for review PR is ready to be reviewed. Should pass CI and removed Status: In progress Dev team is currently working on this labels Nov 30, 2022
dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
return scaler

@classmethod
def instantiate_dessia(cls, scaler, name: str = ''):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wish to type scaler with the BaseScaler type, consider using the forward reference typing :
def method(cls, scaler : 'BaseScaler', name: str = ""):
This works perfectly fine and is supported by dessia_common. I don't know if this is what you intend to do, though.
Just a comment anyway, if the method is not meant to be ran on platform that's purely optional and for documentation purpose (for example, it would have help understand this bit of code faster :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunatly, the scaler here is the scikit-learn object from sklearn.preprocessing.StandardScaler, which cannot be serialized.
Maybe I should write it ?
def method(cls, scaler : preprocessing.StandardScaler, name: str = ""):

And the same for models below ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could. Is this function meant to be called from the platform ? (and thus, be opened in a form) In this case, preprocessing.StandardScaler cannot work as you need typings supported by dessia_common.
In the other case, its just for doc purpose and not mandatory at all. Actually sometimes I even advise against it, so I let you do whatever you prefer on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

methods instantiate are not supposed to be used in platform, so the typing is for doc purpose only I guess.

I'm ok with setting types from sklearn, for the sake of readability and simplicity, so i'll do it.

Copy link
Contributor Author

@Tanguylo Tanguylo Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some reflexion and a try it appears that the sklearn scaler expected here has not a predefined type. It can be StandardScaler, LinearScaler, 'HardName',...

So I decide not to set it, except if i can write this :
def instantiate_dessia(cls, scaler: preprocessing, name: str = '')

which i guess is not possible since preprocessing is a module...

Are you ok with it ?

dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
self.tree_state = tree_state
BaseModel.__init__(self, name=name)

def _data_hash(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Red light is turning on here ! data_hash is overwritten but not data_eq. Why is that ?

It might be perfectly all right, but I feel obliged to ask because that might be "dangerous" as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i did not want to rewrite anything of these methods but i have encountered some performance problems when calling check_platform method for RandomForest.

RandomForest is a list of DecisionTree which contain BaseTree.
So, when running the check_platform, RandomForest could take some 45 seconds to be checked (n_estimators_ x DecisionTree, each containing a BaseTree). I think it is too long. So I made a profiling, which sent me to the data_hash method for BaseTree.

After some investigations, it seems like the data_hash method does not like so much dict of list of list, from the tree_state attribute.
So I just changed it dumbly without questioning anything else, counting on you to correct me if i'm wrong.

Do you suggest me to also rewrite the data_eq method for BaseTree ?
Why ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got the right call ! That is exactly the point of it.

Simply put, the hash and eq function are a bit alike. Hash is a "quick and efficient" eq that does not guarantee equality, but does garantee non-equality.
So basically, what you are doing when you a == b, is that you are evaluating the hash function of both and compare the result, which are integers. If they are different, it is sufficient to guarantee that a and b are different. On the other, if they are equal, it is necessary for a and b to be equal. In this case, you evaluate the __eq__ function which can be costly.

It becomes handy when you want to check if a in massive_sequence:, and only compute the hashes of the elements without even caring running the __eq__ function on every sequence element. You can gain a massive amount of time.

Formerly, we overwrote the __hash__ and __eq__ function of python, but now we have data_hash and data_eq, which is basically the same. Actually, we do overwrite the eq/hash functions just to call data_eq and data_hash, if ever the eq_is_data_eq class variable is True.

All of this means that the __eq__ and __hash__ functions must be consistent and that is what I am trying to figure out. This is "dangerous" because you could miss equal objects if ever the hashes computed for two "considered-equal" objects are different. For example, this is a really bad hash/eq combination :

def __init__(self, a, b):
    self.a = a
    self b = b

def __hash__(self):
    return hash(self.a) + hash(self.b)

def __eq__(self, other):
    return self.a == other.a

Two object will be equal or not regarding a but the hash uses b to compute. So we would miss the equality if b attributes give different hashes.

If hash and eq are inconsistent, at best you'll introduce performance issues, at worst you'll miss equalities

Copy link
Contributor Author

@Tanguylo Tanguylo Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if i have a._data_hash() == b._data_hash() the next step is necessarily to check a.__eq__(b), isn't it ?

If yes, what i wrote is ok for your warning

But I could maybe improve the eq method to get it faster.
But i find it dangerous to do it for objects i did not build myself.

dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@GhislainJ GhislainJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really solid ! A pleasure to read

dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
self.tree_state = tree_state
BaseModel.__init__(self, name=name)

def _data_hash(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got the right call ! That is exactly the point of it.

Simply put, the hash and eq function are a bit alike. Hash is a "quick and efficient" eq that does not guarantee equality, but does garantee non-equality.
So basically, what you are doing when you a == b, is that you are evaluating the hash function of both and compare the result, which are integers. If they are different, it is sufficient to guarantee that a and b are different. On the other, if they are equal, it is necessary for a and b to be equal. In this case, you evaluate the __eq__ function which can be costly.

It becomes handy when you want to check if a in massive_sequence:, and only compute the hashes of the elements without even caring running the __eq__ function on every sequence element. You can gain a massive amount of time.

Formerly, we overwrote the __hash__ and __eq__ function of python, but now we have data_hash and data_eq, which is basically the same. Actually, we do overwrite the eq/hash functions just to call data_eq and data_hash, if ever the eq_is_data_eq class variable is True.

All of this means that the __eq__ and __hash__ functions must be consistent and that is what I am trying to figure out. This is "dangerous" because you could miss equal objects if ever the hashes computed for two "considered-equal" objects are different. For example, this is a really bad hash/eq combination :

def __init__(self, a, b):
    self.a = a
    self b = b

def __hash__(self):
    return hash(self.a) + hash(self.b)

def __eq__(self, other):
    return self.a == other.a

Two object will be equal or not regarding a but the hash uses b to compute. So we would miss the equality if b attributes give different hashes.

If hash and eq are inconsistent, at best you'll introduce performance issues, at worst you'll miss equalities

@staticmethod
def _getstate_dessia(model):
state = model.__getstate__()
dessia_state = {'max_depth': int(state['max_depth'])}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these functions are costly, consider affecting in one-go instead of defining and "appending" other key-values to dessia_state :

dessia_state = {"max_depth": int(state["max_depth"],
                            "node_count": int(state["node_count"],
                            "values": state['values'].tolist(),
                            ....}

This way, you would already "take all the space on a disk" instead of having to change your object and eventually have to relocate the object in memory, which hurts performance on the large scale (same goes form list appending, it is even worse).

You could also write a DessiaState (maybe with another name, why not juste State ?) class that would instantiate itself from this state dict object.

This is True also for the next _setstate_dessia, and you could have an import/export behavior of some sort

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, now that I think about it, I am not that sure about this last comment about State being set as a class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true for all dict. Should i do for every other dict (kwargs ones) in this module ?

About the State, i'm not sure it is useful since it is a requirement from sklearn CTree (the C version of the DecisionTree) and has just been a painful operation and moment to make it work with our constraints...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed all dicts and i keep state as it is now.

dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
dessia_common/datatools/modeling.py Outdated Show resolved Hide resolved
@Tanguylo
Copy link
Contributor Author

Tanguylo commented Mar 1, 2023

Lowered coverage note because developments of this PR are widely enough covered and this PR should not be blocked by coverage problems of other scripts.

tests/dataset.py Outdated
@@ -30,8 +30,6 @@ def prop1(self):
assert(bidon_hlist.common_attributes == ['attr1'])

# Tests on common_attributes


class Bidon(DessiaObject):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bidon is a bit weird as an english name. Would "Dummy" be a better wording ?

In addition to that, is the class defined twice in the module ? its defined on lines 14 and 33

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in #561 (dataset deep_attribute)

@@ -0,0 +1,150 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of commented code in this file. Are these necessary ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script file has been fully reviewed for better reading and UX => small refactor implied


def train_test_split(self, ratio: float = 0.8, shuffled: bool = True) -> List[Matrix]:
""" Generate train and test Datasets from current Dataset. """
ind_train, ind_test = models.get_split_indexes(len(self), ratio=ratio, shuffled=shuffled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ind denotes index, we should either call it i or index to avoid truncated variable names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def train_test_split(self, ratio: float = 0.8, shuffled: bool = True) -> List[Matrix]:
""" Generate train and test Datasets from current Dataset. """
ind_train, ind_test = models.get_split_indexes(len(self), ratio=ratio, shuffled=shuffled)
return Dataset(self[ind_train], name=self.name + '_train'), Dataset(self[ind_test], name=self.name + '_test')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function return is typed as a List[Matrix], but function returns a tuple.

In addition to that, consider naming all arguments when a function/method/class needs more than one :

foo = OneArgClass(whatever)

bar = SeveralArgsClass(first_arg=something, second_arg=something_else, name="stuff")

This is incredibily useful when refactoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -47,6 +48,17 @@ def prettyname(name: str) -> str:
pretty_name += ' '
return pretty_name

def maximums(matrix: List[List[float]]) -> List[float]:
""" Compute maximum values and store it in a list of length `len(matrix[0])`. """
if not isinstance(matrix[0], list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually dangerous specific isinstance list check here. Is it the only exact type you want to handle or you also want to capture other iterables (Tuple, ...) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove input typing but did not move it into datatools

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally moved in math.py in datatool which is the former metrics.py

Copy link
Contributor Author

@Tanguylo Tanguylo May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And put union type of vector and matrix, that could be objects with inheritance rules, if time was like in the magic room of dragon ball


def _data_hash(self):
hash_ = npy.linalg.norm(self.tree_state['values'][0])
hash_ += sum(self.n_classes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the sum of the list an efficient-enough hash for your case ? Don't you have too much collision and dispersion issues ?

Ideas for improving if this is not the case :

  • Multiply it by its length
  • Weight some values by a prime number.
  • Check the generic dessia sequence hash to see if it suits your needs (Uses first and last elements for hash generation, soon to be improved with controlled randomness)

Copy link
Contributor Author

@Tanguylo Tanguylo Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, i don't really care. The purpose is to quickly eliminate non equal trees, because their == is long (because they are list of list of list).

Are you ok with my answer ?


@classmethod
def _skl_class(cls):
return tree._tree.Tree
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert wide thinking emoji

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not my stuff ;)

return self._skl_class()(self.n_features, npy.array(self.n_classes), self.n_outputs)

@staticmethod
def _getstate_dessia(model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these functions be explicitly named after dessia ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed dessia


@classmethod
def _check_criterion(cls, criterion: str):
if 'egressor' not in cls.__name__ and criterion == 'squared_error':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"egressor" check looks unsafe (in the case of a refactor, for ex). Can't we use Subclass stuff ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

'parameters': parameters}

@classmethod
def _check_criterion(cls, criterion: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate bit of code. Could it be set in Model class to mutualize ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@GhislainJ GhislainJ added Status: Commented Draw attention to comments written by the dev team and removed Status: Stand by Issue or PR is not evolving labels Mar 6, 2023
@Tanguylo Tanguylo added the Status: To be validated Issue or PR should be validated by dev team label Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Commented Draw attention to comments written by the dev team Status: To be validated Issue or PR should be validated by dev team Type: Feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants