-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added multi_pivot_paraphrases_generation transformation #252
base: main
Are you sure you want to change the base?
Conversation
Thanks for the submission! Can you please make sure this is not a duplicated tranformation? For instance #94 already creates paraphrases. Should we merge the two? |
@AudayBerro ping! |
Okay, this seems to be a great transformation and should be added to NL Augmenter. Here are a few comments. It would be great if you can address them and we are happy to merge. |
|
||
## What type of a transformation is this? | ||
This transformation is a paraphrase generation for Natural English Sentences by lveraging Pivot-Transaltion techniques. The Pivot-Trnasaltion technique allow to get lexically and syntaxically diverse paraphrases. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @mille-s mentioned, you might want to specify how this is different from earlier PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this question.
In general to generate paraphrases we followed a data-flow principle, by splitting the process to 2 main step:
a) candidate over-generation: generate as many possible candidate using Pivot-translation techniques trough a predefined set of pivot languages
b) candidate selection: where semantically irrelevant paraphrases candidates are removed from the finale list through cosine similarity scores.
To resume here are the differences:
- We generate the paraphrase by pivot translation techniques using two pivot-levels(1-pivot meaning we have one pivot language or 2-pivot meaning we have 2 pivot languages) that are configurable, the user can choose from the beginning the level of paraphrase generation. e.g. 1-level => English-Italian-English || 2-level => English-Chinese-Russian-English.
- What makes our work different from others is that we use a manually defined list of pivot languages so that the sentences are more distinct and semantically related to the reference sentence. The languages were selected respecting two criteria:a) more the grammar of the pivot language is different from the source language(in our case is English) more we get syntactical diversity; b) more the grammar of the pivot language is close to the source language more we get lexical diversity and semantic relatedness. To resume we don't use one pivot-language as in the other works, instead we use the entire list of predefined languages respecting the selected pivot-level. e.g. if you choose 1-pivot level the paraphrases will be generated by translating respectively to each language in the list and translate back to English.
- Generate paraphrases is not enough we should ensure that the paraphrases candidates are semantically related to the reference sentences, since the translator Machine Engine may generate duplicate and semantically unrelated sentences. so the concept is to ensure that the result is of high quality(in our case semantically related to the reference sentences) so we need to perform a quality control step, it can be during generation or after generation of the paraphrase. In our transformation we apply quality control after paraphrase generation by computing the cosine similarity of the embedding vector off the reference sentence and the candidate paraphrase. We support Universal Sentence Encoder embedding, we can add other embedding model like BERT and ELMO but due to time constraint we used USE.
- Candidate selection as mentioned in 3 is configurable, the user can choose to apply or not after generation.
- The Semantic relatedness threshold are configurable and can be changed, in our work we used a minimal score of 0.5(if cosine score is lower than 0.5 the candidate is considered as semantically unrelated)
scikit-learn | ||
tensorflow | ||
tensorflow-hub | ||
transformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most of these libraries are present in the main requirements.txt file. You might want to skip adding them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right
TaskType.QUESTION_GENERATION, | ||
TaskType.TEXT_TO_TEXT_GENERATION | ||
] | ||
languages = ["en"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the relevant keywords here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also in the end you can add a heavy=True parameter too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. I would add the lexical and syntactical key word
"inputs": { | ||
"Reference sentence": "How does COVID-19 spread?" | ||
}, | ||
"outputs": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly these examples look great. :) I would suggest you to also the perform the robustness evaluation for your transformation (or at least in a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the evaluation I wrote to you in an email that I can't get the evaluation script to work properly, I've tried several times and I always have the same problem.
The problem was a dependencies conflict, the runtime environment was not able to download the suggested version of the transformers packages
return response | ||
|
||
|
||
if __name__ == '__main__': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be commented or deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@AudayBerro would you like to address the above comments? |
Hi,
I will try to address them during the weekend.
Regards
Auday Berro
Le jeu. 28 oct. 2021 à 22:52, Kaustubh Dhole ***@***.***> a
écrit :
… @AudayBerro <https://github.com/AudayBerro> would you like to address the
above comments?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIKHJROXVBUK2LVN7EULXOLUJGZZTANCNFSM5DEUGQGA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Dear Kaustubh,
I have resolved the comments. I pushed it on the *paraphraser* branch.
Regards
Auday Berro
Le jeu. 28 oct. 2021 à 22:52, Kaustubh Dhole ***@***.***> a
écrit :
… @AudayBerro <https://github.com/AudayBerro> would you like to address the
above comments?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIKHJROXVBUK2LVN7EULXOLUJGZZTANCNFSM5DEUGQGA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
This transformation generate a list of paraphrases of an English sentence following two steps:
1- Candidate Over-generation by leveraging Pivot-Translation techniques, bu translating the sentence to a curated list of languages using the Hugging-face Marian MT and UKPLab-EasyNMT Machine translator models.
2- After candidate Over-generation the list may contain some semantically unrelated or duplicated paraphrases. This step ensure to filter them from the final list by leveraging Universal Sentence Encoder embedding model. The idea is to compare the cosine similarity of the USE_embedding between the reference sentence and the candidate paraphrase. If the score is below o.5 the candidate is considered as semantically unrelated to the reference sentence; if score > 0.95 the candidate is a duplication of the reference; 0.5< score < 0.95 the candidate is accepted