-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add language-codes-standardizations #230
base: main
Are you sure you want to change the base?
Conversation
Hey, thanks for the PR. I can see the value in having such a feature but there is an issue with such a solution. Some languages have multiple dialects. Take the following for example:
From what I understand, the default sorted mapping would have one of these languages overwrite the other. |
Yes, thats correct. The dialects will be overwritten in some cases because the language code does not support those dialects. But still, it's necessary to map them to get the corresponding language code even if the dialects will disappear. Thats also the reason why the introduced flores200. They needed a language code which also supports dialects. But not everyone is using flores200. |
Thanks for confirming. This service should support the complete FLORES-200 language set. I believe that we should leave the mapping to the consumers so that they can figure out how they want to handle the dialects. Otherwise, if they were to rely on our endpoints to do so, they might get unexpected behaviour if they were expecting another dialect instead. |
To address this, I added a new endpoint that allows users to choose whether they want to use it or not. The main challenge I'm facing is that I'm working with multiple programming languages, and not all of them have libraries that support mapping language codes like that. So it would be easier to add it or the api |
I am still not convinced that this fits here. It sounds like you need a separate service that will return the mappings for you. In fact, the better solution is to generate these mappings once and save it as JSON. The FLORES-200 won't change anyways. |
Here are two examples why I think this feature is necessary:
Also this list is generated ad startup only once. So it's almost similar to reading a json file at startup. |
I don't think your implementation does this currently.
Yes, except that you are adding dependencies to the project. |
There are many language code standardizations, such as FLORES-200, ISO-639-3, and IETF-BCP-47. Different libraries use different standards, and it can be difficult to map between them.
To address this, I have added a
/languages
endpoint to the API, which can be sorted by a specific standard using a query parameter, such as/languages?standardization=IETF-BCP-47
. This will help users of the API easily retrieve the language code standardization they need for further processing.For example, in a live audio translation service I’m working on, the Whisper library returns language codes that are not in FLORES-200, requiring conversion to another format. Furthermore, when displaying the results on a website, I need a different format since the site Im working with doesn't support FLORES-200. Mapping these codes manually is tedious, which is why I added this feature to simplify the process for everyone.
Additionally, the
/languages
endpoint provides a list of all supported translations in this API, allowing users to quickly determine which languages are available for translation and how to convert between different language code standards.