Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release datasets on Hugging Face #7

Open
NielsRogge opened this issue Nov 14, 2024 · 2 comments
Open

Release datasets on Hugging Face #7

NielsRogge opened this issue Nov 14, 2024 · 2 comments

Comments

@NielsRogge
Copy link

Hello @sascha-kirch 🤗

I'm Niels and work as ML engineer at Hugging Face. I discovered your work as it got featured in AK's daily papers: https://huggingface.co/papers/2411.00233. The paper page lets people discuss about your paper and lets them find artifacts about it (the datasets used in your paper for instance) you can also claim the paper as yours which will show up on your public profile at HF.

Would you like to host the datasets on https://huggingface.co/datasets? Hosting on Hugging Face will give them more visibility/enable better discoverability, and will also allow people to do:

from datasets import load_dataset

dataset = load_dataset("your-hf-org/yes-but")

If you're down, leaving a guide here: https://huggingface.co/docs/datasets/loading. If the Datasets format doesn't work for your dataset, then see https://huggingface.co/docs/huggingface_hub/en/guides/upload.

Besides that, there's the dataset viewer which allows people to quickly explore the first few rows of the data in the browser.

After uploaded, we can also link the datasets to the paper page (read here) so people can discover your work.

What do you think?

Kind regards,

Niels

@sascha-kirch
Copy link
Owner

sascha-kirch commented Nov 16, 2024

Hi @NielsRogge, thank's for reaching out!

To be honest I did check the guides on Datasets on HF during the development phase of SambaMixer, because I wanted to make it easily accessible (which is certainly not the case with this Nasa Battery dataset...).

The one thing that did hold me back was the fact that I am not the original creator of that dataset. Is there a way to explicitly give credit to the original creators?

Besides of that I had some technical doubts concerning the implementation details, which I would figure out after reading the docs.
E.g. how to handle different versions of the dataset (e.g. filtered and raw), how would I implement some pre-processing (e.g. resampling so the time signals have the same number of samples). In the end is very similar to the Audio Dataset example you shared.

We have this CSV and that contains a link to an npy-file that contains the timesignals:
image

I might look into that for our follow-up work.

@NielsRogge
Copy link
Author

The one thing that did hold me back was the fact that I am not the original creator of that dataset. Is there a way to explicitly give credit to the original creators?

You could upload the datasets under your HF username or organization, and give credit to the original authors in the dataset card (README). Of course, you could also reach out to the authors for explicit permission. But if the datasets can be freely downloaded on the website, they might allow redistribution (does the license say anything about that?)

We have this CSV and that contains a link to an npy-file that contains the timesignals:

The Datasets library supports csv files: https://huggingface.co/docs/datasets/loading#csv. Basically you can load the csv file as a 🤗 Dataset, then call dataset.push_to_hub("your-hf-username-or-org/your-dataset") to push it to the hub. We can then also link it to the paper page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants