Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update webhdfs to support BasicAuth and apply proxy on every call #1409

Merged
merged 15 commits into from
Nov 3, 2023

Conversation

gdiepen
Copy link
Contributor

@gdiepen gdiepen commented Oct 31, 2023

I am using DVC and I am trying to store my dvc cache on a webHDFS remote.

Unfortunately, I cannot directly use kerberos authentication and I must use Basic Authentication on a proxy server (which introduces some additional part in the path component of the URL (e.g. instead of /webhdfs/v1 it now becomes /gateway/cluster_name/webhdfs/v1)

I found that the current version of filesystem_spec does not support basic authentication, so this PR will add the support for that.

At the same time, I am using the proxy dictionary/callable argument on every outbound call that is made in the _call function.

Would be happy to add additional unit tests, but unfortunately as far as I know the cluster created with the htcluster application does not support basic authentication.

The only thing I could add as test is check whether the right ValueError is raised whenever password is provided, but user is not.

I have tested this code with a modified version of DVC (that allows for the extra parameters) and using this updated version I was able to connect to our WebHDFS server without a problem

@gdiepen
Copy link
Contributor Author

gdiepen commented Nov 1, 2023

I see that the build of my PR is having a lot of failed checks. However, the changes I made should not affect these by any means as far as I can see ?!?

@martindurant
Copy link
Member

Tests are being hit by pytest-dev/pytest-asyncio#655 - we'll have to wait for a fix upstream

@gdiepen
Copy link
Contributor Author

gdiepen commented Nov 1, 2023

Some more testing today and had problems if I used the apply_proxy with dictionary for data_proxy on the location headers that were returned because it would start applying the proxy conversion as duplicates

In case the user is not None and no password has been provided, we will
default back to the original behavior
Later in the code, we use self.password and compare that to None, so we
must always store the argument as attribute
@martindurant
Copy link
Member

This looks good, but tests won't pass until pytest-asyncio resolve their current import issue, or I decide to stop waiting and pin to a previous version.

@gdiepen
Copy link
Contributor Author

gdiepen commented Nov 9, 2023

@martindurant Thanks for merging this!

I only have one quick question about when you are planning to create a new release that will include this modification. I am working downstream on the components in DVC that use this new feature and there it can only be merged if this modification is available in a new release. And only after DVC after that releases new versions can I use it directly in my code at work (instead of relying on some patched local branches where i hardcode some updated versions also)

@martindurant
Copy link
Member

when you are planning to create a new release

I try to do one every month. I may be a bit behind...

@ruben-s
Copy link

ruben-s commented Nov 28, 2023

Hi @martindurant,

I was looking at the the code of the webhdfs implementation: as it stands I could also use a release version with the changes/fixes for basic authentication from Guido. (I'm trying to write files via Kedro to a hdfs cloudera cluster, which is fronted with a Knox server that takes care of authentication via username/password. Kedro uses fsspec behind the scenes to know how to intereact with WebHDFS.)

Thx in advance.

br,
Ruben

@ruben-s
Copy link

ruben-s commented Dec 5, 2023

Hi @martindurant,

thanks for the release!
My kedro project now has an up to date environment!

br,
Ruben

skshetry added a commit to iterative/dvc-webhdfs that referenced this pull request Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants