✨ Add language-codes-standardizations #230

JulianKropp · 2024-10-01T18:37:48Z

There are many language code standardizations, such as FLORES-200, ISO-639-3, and IETF-BCP-47. Different libraries use different standards, and it can be difficult to map between them.

To address this, I have added a /languages endpoint to the API, which can be sorted by a specific standard using a query parameter, such as /languages?standardization=IETF-BCP-47. This will help users of the API easily retrieve the language code standardization they need for further processing.

For example, in a live audio translation service I’m working on, the Whisper library returns language codes that are not in FLORES-200, requiring conversion to another format. Furthermore, when displaying the results on a website, I need a different format since the site Im working with doesn't support FLORES-200. Mapping these codes manually is tedious, which is why I added this feature to simplify the process for everyone.

Additionally, the /languages endpoint provides a list of all supported translations in this API, allowing users to quickly determine which languages are available for translation and how to convert between different language code standards.

winstxnhdw · 2024-10-01T18:59:30Z

Hey, thanks for the PR. I can see the value in having such a feature but there is an issue with such a solution. Some languages have multiple dialects. Take the following for example:

Central Kanuri (Arabic script)	knc_Arab
Central Kanuri (Latin script)	knc_Latn

From what I understand, the default sorted mapping would have one of these languages overwrite the other.

JulianKropp · 2024-10-02T07:40:06Z

Yes, thats correct. The dialects will be overwritten in some cases because the language code does not support those dialects. But still, it's necessary to map them to get the corresponding language code even if the dialects will disappear. Thats also the reason why the introduced flores200. They needed a language code which also supports dialects. But not everyone is using flores200.

winstxnhdw · 2024-10-02T08:00:47Z

Thanks for confirming. This service should support the complete FLORES-200 language set. I believe that we should leave the mapping to the consumers so that they can figure out how they want to handle the dialects. Otherwise, if they were to rely on our endpoints to do so, they might get unexpected behaviour if they were expecting another dialect instead.

JulianKropp · 2024-10-02T10:49:57Z

To address this, I added a new endpoint that allows users to choose whether they want to use it or not. The main challenge I'm facing is that I'm working with multiple programming languages, and not all of them have libraries that support mapping language codes like that. So it would be easier to add it or the api

winstxnhdw · 2024-10-02T11:07:11Z

I am still not convinced that this fits here. It sounds like you need a separate service that will return the mappings for you. In fact, the better solution is to generate these mappings once and save it as JSON. The FLORES-200 won't change anyways.

JulianKropp · 2024-10-03T10:16:04Z

Here are two examples why I think this feature is necessary:

Example A:
- I receive text from a transcription service that uses BCP-47 language tags. For instance, it might return de-AT (Austrian German).
- Since I already know the language, I don't need to detect it. Instead, I need to convert de-AT into its FLORES-200 equivalent, deu_Latn, for translation purposes.
- If I want to translate the text to Mexican Spanish (es-MX), I would convert that to the FLORES-200 code spa_Latn.
- This process ensures I can easily map between different language code standards and proceed with the translation without manual intervention.
Example B:
- In more complex cases, a language might have multiple scripts. For example, Acehnese can be written in:
  - Arabic script (ace_Arab)
  - Latin script (ace_Latn)
- The API would return both FLORES-200 codes, and I would need to choose the correct one based on what I need.
- If the output text should be in Arabic script, I would use ace_Arab. If it’s in Latin script, I would use ace_Latn. So If I'm always using latn I can filter this if I get two options to choose from.

Also this list is generated ad startup only once. So it's almost similar to reading a json file at startup.

winstxnhdw · 2024-10-03T10:26:21Z

The API would return both FLORES-200 codes, and I would need to choose the correct one based on what I need.
If the output text should be in Arabic script, I would use ace_Arab. If it’s in Latin script, I would use ace_Latn.

I don't think your implementation does this currently. ace_Latn would overwrite ace_Arab as you only have one flores-200 field.

Also this list is generated ad startup only once. So it's almost similar to reading a json file at startup.

Yes, except that you are adding dependencies to the project.

✨ Add language-codes

228b4b3

winstxnhdw force-pushed the main branch from 9b3d240 to 3dbeefa Compare October 1, 2024 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add language-codes-standardizations #230

✨ Add language-codes-standardizations #230

JulianKropp commented Oct 1, 2024

winstxnhdw commented Oct 1, 2024

JulianKropp commented Oct 2, 2024

winstxnhdw commented Oct 2, 2024

JulianKropp commented Oct 2, 2024

winstxnhdw commented Oct 2, 2024

JulianKropp commented Oct 3, 2024

winstxnhdw commented Oct 3, 2024

✨ Add language-codes-standardizations #230

Are you sure you want to change the base?

✨ Add language-codes-standardizations #230

Conversation

JulianKropp commented Oct 1, 2024

winstxnhdw commented Oct 1, 2024

JulianKropp commented Oct 2, 2024

winstxnhdw commented Oct 2, 2024

JulianKropp commented Oct 2, 2024

winstxnhdw commented Oct 2, 2024

JulianKropp commented Oct 3, 2024

winstxnhdw commented Oct 3, 2024