You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @HT-NEKO , the sample subset of the dataset cannot be split by languages as it is intended only for a quick glance at the data. If you want a smaller subset of the dataset you can choose any of sample-10B, sample-100B or sample-1T (corresponding to 10B, 100B, 1T many tokens). These support splitting by language.
Thank you for your work, it has been very helpful, but I have encountered some issues:
my code:
but
ds
contains results outside of English:Thank you for your reply!
The text was updated successfully, but these errors were encountered: