extend lemmatization #34

dschwalm · 2022-01-26T08:09:06Z

dschwalm
Jan 26, 2022

Hello,

I have a question regarding lemmatization.
For the input 'macskás könyvek' I expected the lemmas as 'macska', 'könyv'. Instead, 'macskás' could not be lemmatized to 'macska'.
'macskát', 'macskák', 'macskával' could be lemmatized successfully.

Is there a way to enhance the lemmatizer to support this grammar structure 'macskás', 'lovas', 'havas'?
I would gladly contribute in the implementation.

Thanks,
Daniel

ps. my code:

import spacy nlp = spacy.load("hu_core_news_lg") doc = nlp("macskás könyvek") for token in doc: print(token, token.lemma, token.lemma_)

Output:
macskás 8631042501551719371 macskás
könyvek 2221763149769740023 könyv

oroszgy · 2022-01-26T09:42:46Z

oroszgy
Jan 26, 2022
Maintainer

@dschwalm, thanks for your feedback!. The lemmatizer model used is a machine learning based solution, although it is fairly simplistic. This means that there is no straightforward way to handle a grammatical case inside this system.
The good news is that we are actively developing a new, better lemmatizer, which we hope to release soon.

2 replies

dschwalm Jan 26, 2022
Author

Thanks for your response. So there is no straightforward to further train this model, right? E.g. manually labeling let's say 500-1000 examples for the structures above.
Sorry for the lame questions, I have only very basic ideas how these models are developed.

If there is any way I can contribute, any entry-level tasks, please feel free to let me know.
In the meantime I can't wait for the enhanced lemmatizer :)

oroszgy Jan 27, 2022
Maintainer

Thanks for willing to annotate :), however I think that the new model must precede this effort. What is more, Szeged Corpus and NerKor Corpus have hundreds of thousands words manually lemmatized, and I am pretty sure they include such cases.

Lemmatization is a high priority task for the project, so I realy hope that we will be able to improve on this component soon.

oroszgy · 2022-03-25T07:41:21Z

oroszgy
Mar 25, 2022
Maintainer

@dschwalm there is a new, improved lemmatization in the works. Expect a release in weeks. Hope that this will solve your problem.

7 replies

dschwalm Apr 4, 2022
Author

Hi,

I tried it out.
Computation speed on GPU is fine, however the lemmatizer accuracy is still below my expectations, at least for the example below.
For other examples, it is quite accurate though. E.g. fenekestül -> fenék, pontjáról -> pont, stb.
Any ideas why 'havon' -> 'haov' or 'köve' -> 'kööv'?
Maybe this grammar structure of 'hó' -> 'hav..', 'kő' -> 'köv...' or 'ló' -> 'lov..' is not handled properly at all?

`doc = nlp("Macskás könyvek a lovaknak. A havon ott a bölcsek köve.")

for token in doc:
print(token, token.lemma, token.lemma_)`

Macskás 18159704862610648674 Macskás
könyvek 2221763149769740023 könyv
a 11901859001352538922 a
lovaknak 6940632298909906175 loov
. 12646065887601541794 .
A 11901859001352538922 a
havon 7344964344423818101 haov
ott 14514536082146256733 ott
a 11901859001352538922 a
bölcsek 11169626285022315233 bölcs
köve 1032765474721085731 kööv
. 12646065887601541794 .

oroszgy Apr 4, 2022
Maintainer

Thanks for your input, we'll look into these cases.

dschwalm Apr 4, 2022
Author

Thanks.
Oh and I forgot to mention that 'macskás' -> is still 'macskás' :)

oroszgy Apr 7, 2022
Maintainer

"Macskás" should be OK as it is an adjective in this context, however the other two is strange indeed.

oroszgy Mar 22, 2023
Maintainer

@dschwalm There is some progress regarding the lemmatization component, although I must admit it is still not perfect. For the most recent results, you can check this notebook https://colab.research.google.com/drive/1e12X8lpU5NNCJPUKrtfj9sONUmBONP5h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HuSpaCy

extend lemmatization #34

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

HuSpaCy

extend lemmatization #34

dschwalm Jan 26, 2022

Replies: 2 comments · 9 replies

oroszgy Jan 26, 2022 Maintainer

dschwalm Jan 26, 2022 Author

oroszgy Jan 27, 2022 Maintainer

oroszgy Mar 25, 2022 Maintainer

dschwalm Apr 4, 2022 Author

oroszgy Apr 4, 2022 Maintainer

dschwalm Apr 4, 2022 Author

oroszgy Apr 7, 2022 Maintainer

oroszgy Mar 22, 2023 Maintainer

dschwalm
Jan 26, 2022

Replies: 2 comments 9 replies

oroszgy
Jan 26, 2022
Maintainer

dschwalm Jan 26, 2022
Author

oroszgy Jan 27, 2022
Maintainer

oroszgy
Mar 25, 2022
Maintainer

dschwalm Apr 4, 2022
Author

oroszgy Apr 4, 2022
Maintainer

dschwalm Apr 4, 2022
Author

oroszgy Apr 7, 2022
Maintainer

oroszgy Mar 22, 2023
Maintainer