You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Segmentation tool you provide is excellent. One feature request:
Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a preserve_case or capitalize parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.
from ekphrasis.classes.segmenter import Segmenter
segmenter = Segmenter(corpus="twitter")
def word_segmentation(text, fix_case=True):
words_string = segmenter.segment(text)
if not fix_case:
return words_string
fixed = ""
n_add = 0
for i in range(len(words_string)):
if words_string[i] == " " and text[i+n_add] != " ":
n_add += 1
fixed += " "
continue
is_capital = text[i-n_add].isupper()
if is_capital:
fixed += words_string[i].upper()
else:
fixed += words_string[i]
return fixed
Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:
I #eatsomuch food --> I eat so much food. I care so much. #IranProtests --> I care so much. Iran Protests
Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.
The text was updated successfully, but these errors were encountered:
The Segmentation tool you provide is excellent. One feature request:
Unless I am mistaken, the tool always provided the split words in 1. lower case, and 2. does not provide information for where spaces were inserted. Instead, a
preserve_case
orcapitalize
parameter would be helpful (for 1). The following code capitalizes the split string according to the capitalization used in the hashtag.Of course, if the user is using camelCase or PascalCase, the capitalization may not be meaningful, but in other cases, this can be. For instance:
I #eatsomuch food
-->I eat so much food.
I care so much. #IranProtests
-->I care so much. Iran Protests
Arguably, the use of a stand-alone hashtag approximately refers to a proper noun, in which case the adopted capitalization is meaningful.
The text was updated successfully, but these errors were encountered: