Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(datasets): remove unsafe PyTorch load and save #850

Open
deepyaman opened this issue Sep 24, 2024 · 8 comments
Open

bug(datasets): remove unsafe PyTorch load and save #850

deepyaman opened this issue Sep 24, 2024 · 8 comments

Comments

@deepyaman
Copy link
Member

deepyaman commented Sep 24, 2024

CI is failing due to https://bandit.readthedocs.io/en/latest/plugins/b704_pytorch_load_save.html, a Bandit rule that was introduced earlier today in their 1.7.10 release.

safetensors is one possible option for addressing this, but I think somebody who is using TensorFlow with Kedro should ideally weigh in (I don't know what the additional considerations there could be).

I'm also not sure if this is a real issue TBH; we're not loading arbitrary data in an application, but rather wrapping an interface that the user could use to load their data.

@deepyaman
Copy link
Member Author

@gitgud5000 I saw you have another PR with a proposed improvement to the TensorFlowModelDataset; assuming that means you're using it, would you be open to taking a look at what makes sense to you here. :)

@DimedS
Copy link
Member

DimedS commented Sep 25, 2024

Thanks for flagging this, @deepyaman. Perhaps we can skip the check for now (especially since this is still an experimental dataset) and take some time to find a more permanent solution. There's also a possibility that PyTorch may update its methods in the future?

@merelcht
Copy link
Member

Perhaps we can skip the check for now (especially since this is still an experimental dataset) and take some time to find a more permanent solution.

This is not an experimental dataset, but perhaps we should demote it and ignore the issue until it's resolved properly.

@ankatiyar
Copy link
Contributor

Perhaps we can skip the check for now (especially since this is still an experimental dataset) and take some time to find a more permanent solution.

This is not an experimental dataset, but perhaps we should demote it and ignore the issue until it's resolved properly.

@merelcht The bandit check is failing in kedro_datasets_experimental.pytorch.PytorchDataset. I've added #nosec comments in #845 to unblock the CI but we can leave this issue open for the author of the dataset to take care of?

@merelcht
Copy link
Member

Oh my bad, I thought it was the TensorFlowModelDataset that was the problem!

Yeah let's ping the author. They might be able to give insight into whether this is a real issue or safe to ignore.

@ankatiyar
Copy link
Contributor

I believe it was @bpmeek!

Copying the traceback here for reference -

[main]	INFO	profile include tests: None
[main]	INFO	profile exclude tests: None
[main]	INFO	cli include tests: None
[main]	INFO	cli exclude tests: None
[main]	INFO	running on Python 3.11.10
Run started:2024-09-25 00:12:07.733319

Test results:
	No issues identified.

Code scanned:
	Total lines of code: 35[28](https://github.com/kedro-org/kedro-plugins/actions/runs/11023860216/job/30615960774#step:8:29)
	Total lines skipped (#nosec): 0
	Total potential issues skipped due to specifically being disabled (e.g., #nosec BXXX): 0

Run metrics:
	Total issues (by severity):
		Undefined: 0
		Low: 49
		Medium: 0
		High: 0
	Total issues (by confidence):
		Undefined: 0
		Low: 0
		Medium: 1
		High: 48
Files skipped (0):
[main]	INFO	profile include tests: None
[main]	INFO	profile exclude tests: None
[main]	INFO	cli include tests: None
[main]	INFO	cli exclude tests: None
[main]	INFO	running on Python 3.11.10
Run started:2024-09-25 00:12:07.635264

Test results:
	No issues identified.

Code scanned:
	Total lines of code: 2671
	Total lines skipped (#nosec): 2
	Total potential issues skipped due to specifically being disabled (e.g., #nosec BXXX): 0

Run metrics:
	Total issues (by severity):
		Undefined: 0
		Low: 6
		Medium: 0
		High: 0
	Total issues (by confidence):
		Undefined: 0
		Low: 0
		Medium: 0
		High: 6
Files skipped (0):
[main]	INFO	profile include tests: None
[main]	INFO	profile exclude tests: None
[main]	INFO	cli include tests: None
[main]	INFO	cli exclude tests: None
[main]	INFO	running on Python 3.11.10
Run started:2024-09-25 00:12:07.607821

Test results:
	No issues identified.

Code scanned:
	Total lines of code: 2374
	Total lines skipped (#nosec): 0
	Total potential issues skipped due to specifically being disabled (e.g., #nosec BXXX): 0

Run metrics:
	Total issues (by severity):
		Undefined: 0
		Low: 2
		Medium: 0
		High: 0
	Total issues (by confidence):
		Undefined: 0
		Low: 0
		Medium: 0
		High: 2
Files skipped (0):
[main]	INFO	profile include tests: None
[main]	INFO	profile exclude tests: None
[main]	INFO	cli include tests: None
[main]	INFO	cli exclude tests: None
[main]	INFO	running on Python 3.11.10
Run started:2024-09-25 00:12:07.685846

Test results:
>> Issue: [B614:pytorch_load_save] Use of unsafe PyTorch load or save
   Severity: Medium   Confidence: High
   CWE: CWE-502 (https://cwe.mitre.org/data/definitions/502.html)
   More Info: https://bandit.readthedocs.io/en/1.7.10/plugins/b614_pytorch_load_save.html
   Location: ./kedro-datasets/kedro_datasets_experimental/pytorch/pytorch_dataset.py:99:15
98	        load_path = get_filepath_str(self._get_load_path(), self._protocol)
99	        return torch.load(load_path, **self._fs_open_args_load)
100	

--------------------------------------------------
>> Issue: [B614:pytorch_load_save] Use of unsafe PyTorch load or save
   Severity: Medium   Confidence: High
   CWE: CWE-502 (https://cwe.mitre.org/data/definitions/502.html)
   More Info: https://bandit.readthedocs.io/en/1.7.10/plugins/b614_pytorch_load_save.html
   Location: ./kedro-datasets/kedro_datasets_experimental/pytorch/pytorch_dataset.py:103:8
102	        save_path = get_filepath_str(self._get_save_path(), self._protocol)
103	        torch.save(data.state_dict(), save_path, **self._fs_open_args_save)
104	

--------------------------------------------------

Code scanned:
	Total lines of code: [33](https://github.com/kedro-org/kedro-plugins/actions/runs/11023860216/job/30615960774#step:8:34)94
	Total lines skipped (#nosec): 1
	Total potential issues skipped due to specifically being disabled (e.g., #nosec BXXX): 0

Run metrics:
	Total issues (by severity):
		Undefined: 0
		Low: 0
		Medium: 2
		High: 0
	Total issues (by confidence):
		Undefined: 0
		Low: 0
		Medium: 0
		High: 2
Files skipped (0):

@gitgud5000
Copy link
Contributor

@deepyaman I agree—this is just wrapping the interface for saving/loading in-house objects, not external or intended for distribution...

I haven't tried the safetensors with TensorFlow yet, but with Keras > 3, their serialization forces you to use the built-in methods/API anyway.

@bpmeek
Copy link
Contributor

bpmeek commented Sep 26, 2024

I can try to look into this more tomorrow. I'm a little confused though because when I wrote this I was attempting to follow PyTorch's recommended load and save method but perhaps something has changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants