Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Add language-codes-standardizations #230

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JulianKropp
Copy link

There are many language code standardizations, such as FLORES-200, ISO-639-3, and IETF-BCP-47. Different libraries use different standards, and it can be difficult to map between them.

To address this, I have added a /languages endpoint to the API, which can be sorted by a specific standard using a query parameter, such as /languages?standardization=IETF-BCP-47. This will help users of the API easily retrieve the language code standardization they need for further processing.

For example, in a live audio translation service I’m working on, the Whisper library returns language codes that are not in FLORES-200, requiring conversion to another format. Furthermore, when displaying the results on a website, I need a different format since the site Im working with doesn't support FLORES-200. Mapping these codes manually is tedious, which is why I added this feature to simplify the process for everyone.

Additionally, the /languages endpoint provides a list of all supported translations in this API, allowing users to quickly determine which languages are available for translation and how to convert between different language code standards.

@winstxnhdw
Copy link
Owner

Hey, thanks for the PR. I can see the value in having such a feature but there is an issue with such a solution. Some languages have multiple dialects. Take the following for example:

Central Kanuri (Arabic script) knc_Arab
Central Kanuri (Latin script) knc_Latn

From what I understand, the default sorted mapping would have one of these languages overwrite the other.

@JulianKropp
Copy link
Author

Yes, thats correct. The dialects will be overwritten in some cases because the language code does not support those dialects. But still, it's necessary to map them to get the corresponding language code even if the dialects will disappear. Thats also the reason why the introduced flores200. They needed a language code which also supports dialects. But not everyone is using flores200.

@winstxnhdw
Copy link
Owner

Thanks for confirming. This service should support the complete FLORES-200 language set. I believe that we should leave the mapping to the consumers so that they can figure out how they want to handle the dialects. Otherwise, if they were to rely on our endpoints to do so, they might get unexpected behaviour if they were expecting another dialect instead.

@JulianKropp
Copy link
Author

To address this, I added a new endpoint that allows users to choose whether they want to use it or not. The main challenge I'm facing is that I'm working with multiple programming languages, and not all of them have libraries that support mapping language codes like that. So it would be easier to add it or the api

@winstxnhdw
Copy link
Owner

I am still not convinced that this fits here. It sounds like you need a separate service that will return the mappings for you. In fact, the better solution is to generate these mappings once and save it as JSON. The FLORES-200 won't change anyways.

@JulianKropp
Copy link
Author

Here are two examples why I think this feature is necessary:

  • Example A:

    • I receive text from a transcription service that uses BCP-47 language tags. For instance, it might return de-AT (Austrian German).
    • Since I already know the language, I don't need to detect it. Instead, I need to convert de-AT into its FLORES-200 equivalent, deu_Latn, for translation purposes.
    • If I want to translate the text to Mexican Spanish (es-MX), I would convert that to the FLORES-200 code spa_Latn.
    • This process ensures I can easily map between different language code standards and proceed with the translation without manual intervention.
  • Example B:

    • In more complex cases, a language might have multiple scripts. For example, Acehnese can be written in:
      • Arabic script (ace_Arab)
      • Latin script (ace_Latn)
    • The API would return both FLORES-200 codes, and I would need to choose the correct one based on what I need.
    • If the output text should be in Arabic script, I would use ace_Arab. If it’s in Latin script, I would use ace_Latn. So If I'm always using latn I can filter this if I get two options to choose from.

Also this list is generated ad startup only once. So it's almost similar to reading a json file at startup.

@winstxnhdw
Copy link
Owner

The API would return both FLORES-200 codes, and I would need to choose the correct one based on what I need.
If the output text should be in Arabic script, I would use ace_Arab. If it’s in Latin script, I would use ace_Latn.

I don't think your implementation does this currently. ace_Latn would overwrite ace_Arab as you only have one flores-200 field.

Also this list is generated ad startup only once. So it's almost similar to reading a json file at startup.

Yes, except that you are adding dependencies to the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants