A - Feat/machine learning #422

Tanguylo · 2022-11-08T14:59:19Z

What kind of changes does this PR introduce?

new features

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No

Other information:
Implements sklearn objects for modeling
RandomForest, MLP, Ridge, SVM
Scalers

Tanguylo · 2022-11-30T14:58:34Z

This PR needs some doc writing but code is ready for review :)

dessia_common/datatools/modeling.py

GhislainJ · 2022-12-02T11:10:10Z

dessia_common/datatools/modeling.py

+        return scaler
+
+    @classmethod
+    def instantiate_dessia(cls, scaler, name: str = ''):


If you wish to type scaler with the BaseScaler type, consider using the forward reference typing :
def method(cls, scaler : 'BaseScaler', name: str = ""):
This works perfectly fine and is supported by dessia_common. I don't know if this is what you intend to do, though.
Just a comment anyway, if the method is not meant to be ran on platform that's purely optional and for documentation purpose (for example, it would have help understand this bit of code faster :) )

Unfortunatly, the scaler here is the scikit-learn object from sklearn.preprocessing.StandardScaler, which cannot be serialized.
Maybe I should write it ?
def method(cls, scaler : preprocessing.StandardScaler, name: str = ""):

And the same for models below ?

I guess you could. Is this function meant to be called from the platform ? (and thus, be opened in a form) In this case, preprocessing.StandardScaler cannot work as you need typings supported by dessia_common.
In the other case, its just for doc purpose and not mandatory at all. Actually sometimes I even advise against it, so I let you do whatever you prefer on this

methods instantiate are not supposed to be used in platform, so the typing is for doc purpose only I guess.

I'm ok with setting types from sklearn, for the sake of readability and simplicity, so i'll do it.

After some reflexion and a try it appears that the sklearn scaler expected here has not a predefined type. It can be StandardScaler, LinearScaler, 'HardName',...

So I decide not to set it, except if i can write this :
def instantiate_dessia(cls, scaler: preprocessing, name: str = '')

which i guess is not possible since preprocessing is a module...

Are you ok with it ?

dessia_common/datatools/modeling.py

GhislainJ · 2022-12-02T11:15:40Z

dessia_common/datatools/modeling.py

+        self.tree_state = tree_state
+        BaseModel.__init__(self, name=name)
+
+    def _data_hash(self):


Red light is turning on here ! data_hash is overwritten but not data_eq. Why is that ?

It might be perfectly all right, but I feel obliged to ask because that might be "dangerous" as well

Actually, i did not want to rewrite anything of these methods but i have encountered some performance problems when calling check_platform method for RandomForest.

RandomForest is a list of DecisionTree which contain BaseTree.
So, when running the check_platform, RandomForest could take some 45 seconds to be checked (n_estimators_ x DecisionTree, each containing a BaseTree). I think it is too long. So I made a profiling, which sent me to the data_hash method for BaseTree.

After some investigations, it seems like the data_hash method does not like so much dict of list of list, from the tree_state attribute.
So I just changed it dumbly without questioning anything else, counting on you to correct me if i'm wrong.

Do you suggest me to also rewrite the data_eq method for BaseTree ?
Why ?

I think you got the right call ! That is exactly the point of it.

Simply put, the hash and eq function are a bit alike. Hash is a "quick and efficient" eq that does not guarantee equality, but does garantee non-equality.
So basically, what you are doing when you a == b, is that you are evaluating the hash function of both and compare the result, which are integers. If they are different, it is sufficient to guarantee that a and b are different. On the other, if they are equal, it is necessary for a and b to be equal. In this case, you evaluate the __eq__ function which can be costly.

It becomes handy when you want to check if a in massive_sequence:, and only compute the hashes of the elements without even caring running the __eq__ function on every sequence element. You can gain a massive amount of time.

Formerly, we overwrote the __hash__ and __eq__ function of python, but now we have data_hash and data_eq, which is basically the same. Actually, we do overwrite the eq/hash functions just to call data_eq and data_hash, if ever the eq_is_data_eq class variable is True.

All of this means that the __eq__ and __hash__ functions must be consistent and that is what I am trying to figure out. This is "dangerous" because you could miss equal objects if ever the hashes computed for two "considered-equal" objects are different. For example, this is a really bad hash/eq combination :

def __init__(self, a, b): self.a = a self b = b def __hash__(self): return hash(self.a) + hash(self.b) def __eq__(self, other): return self.a == other.a

Two object will be equal or not regarding a but the hash uses b to compute. So we would miss the equality if b attributes give different hashes.

If hash and eq are inconsistent, at best you'll introduce performance issues, at worst you'll miss equalities

So if i have a._data_hash() == b._data_hash() the next step is necessarily to check a.__eq__(b), isn't it ?

If yes, what i wrote is ok for your warning

But I could maybe improve the eq method to get it faster.
But i find it dangerous to do it for objects i did not build myself.

dessia_common/datatools/modeling.py

…_common into feat/machine_learning

…ich are useless

…le error in kmeans

…an error

…be modified in another PR

…l to scaler in it

…nd write or improve docstrings for these

GhislainJ

Really solid ! A pleasure to read

dessia_common/datatools/modeling.py

GhislainJ · 2022-12-02T14:41:07Z

dessia_common/datatools/modeling.py

+        self.tree_state = tree_state
+        BaseModel.__init__(self, name=name)
+
+    def _data_hash(self):


I think you got the right call ! That is exactly the point of it.

Simply put, the hash and eq function are a bit alike. Hash is a "quick and efficient" eq that does not guarantee equality, but does garantee non-equality.
So basically, what you are doing when you a == b, is that you are evaluating the hash function of both and compare the result, which are integers. If they are different, it is sufficient to guarantee that a and b are different. On the other, if they are equal, it is necessary for a and b to be equal. In this case, you evaluate the __eq__ function which can be costly.

It becomes handy when you want to check if a in massive_sequence:, and only compute the hashes of the elements without even caring running the __eq__ function on every sequence element. You can gain a massive amount of time.

Formerly, we overwrote the __hash__ and __eq__ function of python, but now we have data_hash and data_eq, which is basically the same. Actually, we do overwrite the eq/hash functions just to call data_eq and data_hash, if ever the eq_is_data_eq class variable is True.

All of this means that the __eq__ and __hash__ functions must be consistent and that is what I am trying to figure out. This is "dangerous" because you could miss equal objects if ever the hashes computed for two "considered-equal" objects are different. For example, this is a really bad hash/eq combination :

def __init__(self, a, b): self.a = a self b = b def __hash__(self): return hash(self.a) + hash(self.b) def __eq__(self, other): return self.a == other.a

Two object will be equal or not regarding a but the hash uses b to compute. So we would miss the equality if b attributes give different hashes.

If hash and eq are inconsistent, at best you'll introduce performance issues, at worst you'll miss equalities

GhislainJ · 2022-12-02T14:54:35Z

dessia_common/datatools/modeling.py

+    @staticmethod
+    def _getstate_dessia(model):
+        state = model.__getstate__()
+        dessia_state = {'max_depth': int(state['max_depth'])}


If these functions are costly, consider affecting in one-go instead of defining and "appending" other key-values to dessia_state :

dessia_state = {"max_depth": int(state["max_depth"], "node_count": int(state["node_count"], "values": state['values'].tolist(), ....}

This way, you would already "take all the space on a disk" instead of having to change your object and eventually have to relocate the object in memory, which hurts performance on the large scale (same goes form list appending, it is even worse).

You could also write a DessiaState (maybe with another name, why not juste State ?) class that would instantiate itself from this state dict object.

This is True also for the next _setstate_dessia, and you could have an import/export behavior of some sort

Well, now that I think about it, I am not that sure about this last comment about State being set as a class

This is true for all dict. Should i do for every other dict (kwargs ones) in this module ?

About the State, i'm not sure it is useful since it is a requirement from sklearn CTree (the C version of the DecisionTree) and has just been a painful operation and moment to make it work with our constraints...

I've changed all dicts and i keep state as it is now.

dessia_common/datatools/modeling.py

…gin with linearregression docstring

…ollowing review comment

…is covered at least 95%

Tanguylo · 2023-03-01T16:24:56Z

Lowered coverage note because developments of this PR are widely enough covered and this PR should not be blocked by coverage problems of other scripts.

GhislainJ · 2023-02-15T11:14:47Z

tests/dataset.py

@@ -30,8 +30,6 @@ def prop1(self):
 assert(bidon_hlist.common_attributes == ['attr1'])

 # Tests on common_attributes
-
-
 class Bidon(DessiaObject):


Bidon is a bit weird as an english name. Would "Dummy" be a better wording ?

In addition to that, is the class defined twice in the module ? its defined on lines 14 and 33

Changed in #561 (dataset deep_attribute)

GhislainJ · 2023-02-15T11:17:25Z

tests/datatools_modeler.py

@@ -0,0 +1,150 @@
+"""


There is a lot of commented code in this file. Are these necessary ?

The script file has been fully reviewed for better reading and UX => small refactor implied

GhislainJ · 2023-03-06T10:40:55Z

dessia_common/datatools/dataset.py

+
+    def train_test_split(self, ratio: float = 0.8, shuffled: bool = True) -> List[Matrix]:
+        """ Generate train and test Datasets from current Dataset. """
+        ind_train, ind_test = models.get_split_indexes(len(self), ratio=ratio, shuffled=shuffled)


If ind denotes index, we should either call it i or index to avoid truncated variable names

GhislainJ · 2023-03-06T10:41:23Z

dessia_common/datatools/dataset.py

+    def train_test_split(self, ratio: float = 0.8, shuffled: bool = True) -> List[Matrix]:
+        """ Generate train and test Datasets from current Dataset. """
+        ind_train, ind_test = models.get_split_indexes(len(self), ratio=ratio, shuffled=shuffled)
+        return Dataset(self[ind_train], name=self.name + '_train'), Dataset(self[ind_test], name=self.name + '_test')


Function return is typed as a List[Matrix], but function returns a tuple.

In addition to that, consider naming all arguments when a function/method/class needs more than one :

foo = OneArgClass(whatever) bar = SeveralArgsClass(first_arg=something, second_arg=something_else, name="stuff")

This is incredibily useful when refactoring

GhislainJ · 2023-03-06T10:48:33Z

dessia_common/utils/helpers.py

@@ -47,6 +48,17 @@ def prettyname(name: str) -> str:
                pretty_name += ' '
    return pretty_name

+def maximums(matrix: List[List[float]]) -> List[float]:
+    """ Compute maximum values and store it in a list of length `len(matrix[0])`. """
+    if not isinstance(matrix[0], list):


Eventually dangerous specific isinstance list check here. Is it the only exact type you want to handle or you also want to capture other iterables (Tuple, ...) ?

Remove input typing but did not move it into datatools

Finally moved in math.py in datatool which is the former metrics.py

And put union type of vector and matrix, that could be objects with inheritance rules, if time was like in the magic room of dragon ball

GhislainJ · 2023-03-06T11:57:59Z

dessia_common/datatools/learning_models.py

+
+    def _data_hash(self):
+        hash_ = npy.linalg.norm(self.tree_state['values'][0])
+        hash_ += sum(self.n_classes)


Is the sum of the list an efficient-enough hash for your case ? Don't you have too much collision and dispersion issues ?

Ideas for improving if this is not the case :

Multiply it by its length

Weight some values by a prime number.

Check the generic dessia sequence hash to see if it suits your needs (Uses first and last elements for hash generation, soon to be improved with controlled randomness)

Actually, i don't really care. The purpose is to quickly eliminate non equal trees, because their == is long (because they are list of list of list).

Are you ok with my answer ?

GhislainJ · 2023-03-06T11:58:44Z

dessia_common/datatools/learning_models.py

+
+    @classmethod
+    def _skl_class(cls):
+        return tree._tree.Tree


Insert wide thinking emoji

This is not my stuff ;)

GhislainJ · 2023-03-06T12:00:07Z

dessia_common/datatools/learning_models.py

+        return self._skl_class()(self.n_features, npy.array(self.n_classes), self.n_outputs)
+
+    @staticmethod
+    def _getstate_dessia(model):


Should these functions be explicitly named after dessia ?

Removed dessia

GhislainJ · 2023-03-06T12:02:17Z

dessia_common/datatools/learning_models.py

+
+    @classmethod
+    def _check_criterion(cls, criterion: str):
+        if 'egressor' not in cls.__name__ and criterion == 'squared_error':


"egressor" check looks unsafe (in the case of a refactor, for ex). Can't we use Subclass stuff ?

GhislainJ · 2023-03-06T12:06:28Z

dessia_common/datatools/learning_models.py

+                'parameters': parameters}
+
+    @classmethod
+    def _check_criterion(cls, criterion: str):


Duplicate bit of code. Could it be set in Model class to mutualize ?

…actic tests

…for training people and test scripts

…e classé

GhislainJ added the Status: In progress Dev team is currently working on this label Nov 10, 2022

Tanguylo marked this pull request as ready for review November 30, 2022 14:58

Tanguylo added Status: Ready for review PR is ready to be reviewed. Should pass CI and removed Status: In progress Dev team is currently working on this labels Nov 30, 2022

Tanguylo and others added 6 commits November 30, 2022 16:07

feat(machine_learning): set none default values in objects

5361874

feat(machine_learning): code cosmetics

a4984d0

feat(machine_learning): better code organisation in test file

7a07a08

Merge remote-tracking branch 'origin/dev' into feat/machine_learning

fbba1d0

feat(machine_learning): changelog and datatools.rst

06a51b2

Merge branch 'dev' into chore/cluster_coverage

c016c4e

GhislainJ reviewed Dec 2, 2022

View reviewed changes

Tanguylo and others added 12 commits December 2, 2022 12:24

feat(machine_learning): kwargs_dict becomes kwargs

ac4effa

Merge branch 'dev' into feat/machine_learning

e524c92

Merge branch 'feat/machine_learning' of github.com:Dessia-tech/dessia…

c39f883

…_common into feat/machine_learning

feat(machine_learning): remove BaseTree fit and fit_predict method wh…

80d58c4

…ich are useless

feat(machine_learning): docstring in basescaler and test for pydocsty…

b166b1a

…le error in kmeans

feat(machine_learning): remove blank line in cluster, does this push …

a0a306c

…an error

feat(machine_learning): new test for blank line period

474676a

feat(machine_learning): went back to initial cluster file which will …

8552314

…be modified in another PR

feat(machine_learning): docstrings for basescaler

44ea215

feat(machine_learning): finish with docstring in scalers

02f2611

feat(machine_learning): docstrings for labelbinarizer and change mode…

ff2a80c

…l to scaler in it

feat(machine_learning): add output typings on scalers and basemodel a…

0964ee5

…nd write or improve docstrings for these

GhislainJ reviewed Dec 2, 2022

View reviewed changes

Tanguylo added 5 commits December 2, 2022 17:49

feat(machine_learning): remove hasattr(scaler, attr) condition and be…

5be6bf5

…gin with linearregression docstring

feat(machine_learning): modify code according to some review comments

ba526ac

feat(machine_learning): change the way kwargs are built

424f067

feat(machine_learning): change the way to instantiate dessia scaler f…

a34ec92

…ollowing review comment

feat(machine_learning): ridge docstrings

3f7caa0

feat(ML): this PR shall not be blocked for coverage reasons since it …

3c99d83

…is covered at least 95%

GhislainJ reviewed Mar 6, 2023

View reviewed changes

GhislainJ added Status: Commented Draw attention to comments written by the dev team and removed Status: Stand by Issue or PR is not evolving labels Mar 6, 2023

Tanguylo added 19 commits March 29, 2023 14:56

Merge remote-tracking branch 'origin/dev' into feat/machine_learning

b3dd9f9

feat(ML): solve pylint conflict

482329e

feat(ML): first changes after review

872f96d

Merge remote-tracking branch 'origin/dev' into feat/machine_learning

b3d313a

feat(ML): change file organization

13bafd9

feat(ML): add files

3af4a02

feat(ML): fix bug in str

8398008

feat(ML): new tests for str in clustering.py

a06c1d4

feat(ML): quite large refactor for better uses and write required did…

9f6326c

…actic tests

feat(ML): full refactor of modeler for a better UX and write scripts …

6ab0e2e

…for training people and test scripts

feat(ML): remove types in docstring when redundant with arguments types

4140a27

feat(ML): last review comment, but the structure of input, output, preds

2724b11

Merge remote-tracking branch 'origin/dev' into feat/machine_learning

042d97e

feat(ML): fix pylint errorsé

330f2c0

feat(ML): add object to handle input output prediction matrices in on…

8eae28a

…e classé

feat(ML): uncomment ci_scripts code

2e3a4d3

feat(ML): drone stuff

c994e61

feat(ML): changelog

049eeb6

feat(ML): pylint and docstyleé

19d4e54

Tanguylo added the Status: To be validated Issue or PR should be validated by dev team label Jun 5, 2023

Tanguylo added 5 commits June 5, 2023 13:20

feat(ML): increase coverage note

d326173

feat(ML): change assertion rules for scripts in learning models

efaaf0d

feat(ML): set allowed methods (partial)

2d8d512

feat(ML): set more allowed methods

d63f0ac

feat(ML): remove types in docstring and increase protected_access note

cd233cc

A - Feat/machine learning #422

Are you sure you want to change the base?

A - Feat/machine learning #422

Conversation

Tanguylo commented Nov 8, 2022

Tanguylo commented Nov 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tanguylo Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tanguylo Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

GhislainJ left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tanguylo commented Mar 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tanguylo May 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tanguylo Jun 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tanguylo Dec 2, 2022 •

edited

Loading

Tanguylo Dec 2, 2022 •

edited

Loading

Tanguylo May 31, 2023 •

edited

Loading

Tanguylo Jun 2, 2023 •

edited

Loading