You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do you currently have a way to get around this problem? Or should I simply create a new rule implementing the Rule interface that checks for known abbreviations?
The text was updated successfully, but these errors were encountered:
Hi @xD0135, I faced this issue too a while ago. The reason why I left this as it is because the solution would be domain-specific. As you mentioned implementing the Rule interface can be the solution.
If you create a whitelist of tokens for skipping the checking of these words and keep them as tokens that could work. However, I think this would be too domain-specific for this repo.
Or the sentence separator list in the Rule could have ". " or ".\n" instead of ".". But in this case, not all texts could be parsed well. I should know the general usage of this package. If usually, the text originates from emails, forums, chats then changing the sentence separator could work. But if the text is from parsed books then it could break the tokenization.
Hi, first of all thanks for this library, you are awesome 🚀
I'm having an issue ranking text that contains abbreviation such as
U.S.A
(short for United States of America) orNo. 7
(short for Number 7) as the.
is currently used here https://github.com/DavidBelicza/TextRank/blob/master/parse/rule.go#L21 to set the bounds of words.Do you currently have a way to get around this problem? Or should I simply create a new rule implementing the
Rule
interface that checks for known abbreviations?The text was updated successfully, but these errors were encountered: