feat(guess lang): script to guess language and script from tracklist #502
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New script.
Language is guessed through LibreTranslate's detection API by feeding it all track titles + release name. We're using two open instances, first trying one that does very little (no?) rate limiting but has limited language support, then falling back on an instance that allows 15 calls per minute but supports more languages.
Script is guessed by counting the occurrences of character classes in the tracklist and release name. We're using Unicode-aware regular expressions and Unicode property escapes, babel transpiles these. It prefers non-Latin over Latin if both are present (but the non-Latin has to be at least somewhat common on the tracklist). All the "Frequently used" scripts should be supported (except Katakana. Han support is limited, see below).
Known limitations and issues
[multiple languages]
and likely it'll produce no guess at all. This is a limitation in the API: 1) It only ever seems to return a single language; 2) We can't feasibly detect each track individually, as that would require too many requests.