Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(guess lang): script to guess language and script from tracklist #502

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ROpdebee
Copy link
Owner

New script.

Language is guessed through LibreTranslate's detection API by feeding it all track titles + release name. We're using two open instances, first trying one that does very little (no?) rate limiting but has limited language support, then falling back on an instance that allows 15 calls per minute but supports more languages.

Script is guessed by counting the occurrences of character classes in the tracklist and release name. We're using Unicode-aware regular expressions and Unicode property escapes, babel transpiles these. It prefers non-Latin over Latin if both are present (but the non-Latin has to be at least somewhat common on the tracklist). All the "Frequently used" scripts should be supported (except Katakana. Han support is limited, see below).

Known limitations and issues

  • Tracklists with mixed languages aren't supported nicely. It cannot detect [multiple languages] and likely it'll produce no guess at all. This is a limitation in the API: 1) It only ever seems to return a single language; 2) We can't feasibly detect each track individually, as that would require too many requests.
  • Script detection cannot distinguish between simplified and traditional Han and always uses the generic option. While it might be possible to improve these guesses, the only approaches to this I've seen so far literally list all of the characters that can appear in simplified/traditional, and those lists are huge. A heuristic based on release events might work (e.g. Taiwan is likely traditional).
  • Although we could detect Katakana separately, we also fill it as Japanese, since the style guidelines say that Katakana should only be used for translations and 1) I don't think we can reliably detect that and 2) I'm not familiar enough with Japanese releases to implement that myself.
  • Language support is somewhat limited. I've seen language detectors that claim they support 180+ languages, whereas the one we use supports maybe 40 at most. However, the other APIs require API keys and, more importantly, only allow like 50 requests per month on a free plan.
  • Both script and language detection can, of course, produce false matches. Such is life.
  • I'm not a huge fan of the button position, I'd rather put a "Guess case"-like button next to one of the input fields, but there are two input fields and it'd be weird to only have a button next to one of them.
  • Some configuration options might be desirable, e.g. setting the minimum confidence thresholds.

@ROpdebee
Copy link
Owner Author

/deploy-preview

github-actions bot added a commit that referenced this pull request Jun 19, 2022
feat(guess lang): script to guess language and script from tracklist (#502)
@github-actions
Copy link

This PR changes 1 built userscript(s):

See all changes

@codecov
Copy link

codecov bot commented Jun 19, 2022

Codecov Report

Merging #502 (96acc45) into main (9d08206) will decrease coverage by 3.08%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##              main     #502      +/-   ##
===========================================
- Coverage   100.00%   96.91%   -3.09%     
===========================================
  Files           53       55       +2     
  Lines         1222     1266      +44     
  Branches       194      200       +6     
===========================================
+ Hits          1222     1227       +5     
- Misses           0       38      +38     
- Partials         0        1       +1     
Impacted Files Coverage Δ
src/lib/util/format.ts 80.00% <0.00%> (-20.00%) ⬇️
src/mb_guess_language/libretranslate.ts 0.00% <0.00%> (ø)
src/mb_guess_language/script.ts 0.00% <0.00%> (ø)
src/mb_caa_dimensions/Image.ts 98.50% <0.00%> (-1.50%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d08206...96acc45. Read the comment docs.

@ROpdebee
Copy link
Owner Author

ROpdebee commented Aug 8, 2022

chaban has a bunch of feedback at https://community.metabrainz.org/t/ropdebees-userscripts-support-thread/551947/90?u=ropdebee which I still need to process.

@ROpdebee ROpdebee marked this pull request as draft April 23, 2023 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant