Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Gujarati Language support #1845

Merged
merged 5 commits into from
Jan 20, 2025

Conversation

sarjil77
Copy link
Contributor

Hey @felixdittrich92,

i have added the gujarati language and made changes in datasets.srt, please review my PR.

@felixdittrich92
Copy link
Contributor

Hi @sarjil77 👋🏼,

One small issue left here: (See the failing CI job)

AssertionError: Duplicate characters in gujarati_vowels vocab: ['અ']

Please remove all duplicates and afterwards run make style for formatting :)

@felixdittrich92 felixdittrich92 added this to the 0.11.0 milestone Jan 19, 2025
@felixdittrich92 felixdittrich92 self-assigned this Jan 19, 2025
@felixdittrich92 felixdittrich92 added topic: documentation Improvements or additions to documentation type: enhancement Improvement module: datasets Related to doctr.datasets ext: docs Related to docs folder labels Jan 19, 2025
@sarjil77
Copy link
Contributor Author

the 'અ' is not repeating multiple times, it may seems like there is repeating but it is not, the error we are getting is becuase of there is '.' and ':' with the 'અ'. and it is considering it as duplication but in real they are distinct characters.

Solution: i will remove 'અ' containing other full stop and colon. and keeping only character and will use formatting as well.

@sarjil77
Copy link
Contributor Author

done sir, thanks :)

@felixdittrich92
Copy link
Contributor

Formatting looks good now 👍🏼 But still duplicates:

AssertionError: Duplicate characters in gujarati_consonants vocab: ['ક', 'જ', 'ઞ', 'ષ', '્']

:)

@sarjil77
Copy link
Contributor Author

sarjil77 commented Jan 19, 2025

shit, this duplication wont let me sleep, let me fix it now.

@sarjil77
Copy link
Contributor Author

sarjil77 commented Jan 19, 2025

and yes there is not a single duplicated character, but there is some hidden character exists within the char and thats why it is flagging it as an duplicate,

such as: because of 'જ્ઞ' we were getting 'જ', 'ઞ', '્' as duplicate. 'ક' and 'ષ' because it contains subcharcter within the char.

i hope you are getting me

i am fixing it as of now :) :)

Copy link

codecov bot commented Jan 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.56%. Comparing base (48873e0) to head (85876b1).
Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1845      +/-   ##
==========================================
- Coverage   96.60%   96.56%   -0.05%     
==========================================
  Files         165      165              
  Lines        7929     7941      +12     
==========================================
+ Hits         7660     7668       +8     
- Misses        269      273       +4     
Flag Coverage Δ
unittests 96.56% <100.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sarjil77
Copy link
Contributor Author

@felixdittrich92, i think it is fine now, please let me knwo if further error arises, going to sleep now :)

thanks in advance

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now 👍🏼 thanks 😊

@felixdittrich92 felixdittrich92 merged commit ebfc9f3 into mindee:main Jan 20, 2025
69 of 70 checks passed
@sarjil77
Copy link
Contributor Author

Thanks @felixdittrich92, i would also like to hop on to the #1131 which is good first issue, would you please guide me over there, i have made comments over there.

@felixdittrich92
Copy link
Contributor

Thanks @felixdittrich92, i would also like to hop on to the #1131 which is good first issue, would you please guide me over there, i have made comments over there.

👍🏼 Sure will write something done if online in a few minutes 😅

@sarjil77 sarjil77 deleted the jan19_adding_guj branch January 20, 2025 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: docs Related to docs folder module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants