set languages=['en'], but contains results outside of English #122

HT-NEKO · 2024-12-17T07:57:33Z

Thank you for your work, it has been very helpful, but I have encountered some issues:

my code:

ds = load_dataset(
    "/data/public/models/RedPajama-Data-V2/RedPajama-Data-V2/RedPajama-Data-V2.py",
    partition="head_middle",
    languages=["en"],
    name="sample",)

but ds contains results outside of English:

Thank you for your reply!

The text was updated successfully, but these errors were encountered:

mauriceweber · 2025-01-06T08:22:13Z

Hi @HT-NEKO , the sample subset of the dataset cannot be split by languages as it is intended only for a quick glance at the data. If you want a smaller subset of the dataset you can choose any of sample-10B, sample-100B or sample-1T (corresponding to 10B, 100B, 1T many tokens). These support splitting by language.

HT-NEKO closed this as completed Dec 17, 2024

HT-NEKO reopened this Dec 17, 2024

HT-NEKO changed the title ~~设置languages=['en']~~ set languages=['en'], but contains results outside of English Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set languages=['en'], but contains results outside of English #122

set languages=['en'], but contains results outside of English #122

HT-NEKO commented Dec 17, 2024 •

edited

Loading

mauriceweber commented Jan 6, 2025

set languages=['en'], but contains results outside of English #122

set languages=['en'], but contains results outside of English #122

Comments

HT-NEKO commented Dec 17, 2024 • edited Loading

mauriceweber commented Jan 6, 2025

HT-NEKO commented Dec 17, 2024 •

edited

Loading