Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bib 809 machine translation scoring #620

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

kaseywright
Copy link
Contributor

This PR addresses the need to evaluate the effectiveness of machine translations and human edits by determining the similarity between two versions of a resource. To achieve this, we will utilize the Levenshtein Distance algorithm, a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.

The PR includes a Levenshtein Distance implementation and the necessary utility methods to effectively score long resources. A new queue, generate-resource-content-similarity-score, has been created, along with a corresponding message, publisher, and subscriber. The message will contain information about the resource content versions being compared, including the type of comparison being performed, as designated by one of the values in the ResourceContentVersionSimilarityComparisonTypes enum. This allows the subscriber to run the appropriate logic for the resource content types being compared.

Copy link

linear bot commented Dec 20, 2024

Set canellation token to none on Publisher endpoint
resourceContentVersionMachineTranslationId felt a bit redundant as all resources are 'resource content versions'. So, I added 'translation' to be consistent with DB naming
Move publisher services to that new class
@kaseywright
Copy link
Contributor Author

Bump @jwinston-bn @NateMerritt

await _dbContext.SaveChangesAsync(ct);
}

private async Task<ResourceContentVersionSimilarityScoreEntity> GenerateResourceContentVersionSimilarityScoreEntity(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally would like to have the Async suffix on any async methods for consistency (like ProcessAsync above).

And personal preference, I wouldn't typically include types in method names where it can be implied by the return type. This is kind of like if the score were an int, and the method was GenerateSimilarityScoreInt. I would just call it GenerateSimilarityScoreAsync.

return await dbContext
.ResourceContentVersionMachineTranslations
.AsTracking()
.Join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see if we can get this using the object references, rather than using this Join operator.

x => x.ResourceContentVersion.ResourceContentVersionSnapshots

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants