Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use company suffix abbreviations to label company entities #19

Open
jokull opened this issue May 15, 2020 · 2 comments
Open

Use company suffix abbreviations to label company entities #19

jokull opened this issue May 15, 2020 · 2 comments

Comments

@jokull
Copy link
Contributor

jokull commented May 15, 2020

I’m writing some code to detect mentions of companies. The corpus uses the ehf/of/sf/etc. suffixes so that’s a strong indicator for me, and potentially for Greynir too.

I know that the Greynir website has an entity recognizer, but it seems quite strongly coupled to the database. Is there a case for bintokenizer to adapt a new token type? Or perhaps for Greynir to become company-entity aware?

I have some interesting examples of company names if that’s useful. I’m currently using an imperfect regex to match company names and then using Greynir to go back to the indefinite form.

  • Miðbæjarhótel/Centerhotels ehf.
  • Reitir - hótel ehf.
  • 105 Miðborg slhf.
  • Faxaflóahafnir sf.
  • Bjarg íbúðafélag hses.
  • Efstaleitis Apótek ehf.
  • Íþrótta- og sýningahöllin hf.
  • V-16 ehf.

These are the suffixes I’ve come across:

  • ehf.
  • slhf.
  • sf.
  • hses.
  • hf.
  • ohf.
  • bs.
@jokull
Copy link
Contributor Author

jokull commented Aug 17, 2020

Recently better company name tokenization was added. But I noticed there is not an attempt to detect lemmas. At least being able to get to the indefinite form of company names (from Veitna to Veitur) makes sense. Singular for pluralized and other word form changes may not be as useful.

@jokull
Copy link
Contributor Author

jokull commented Aug 18, 2020

Here’s what I’m thinking.

>>> from planitor import greynir
>>> greynir.parse_single('Bréfið barst loksins til Veitna ohf.').lemmas
['bréf', 'bera', 'loksins', 'til', 'Veitna ohf.']
>>> 

Here the correct lemma would be Veitur ohf., but perhaps not a priority for this project to attempt lemming company and/or entity names. I’m doing it manually with a convulated matching algorithm, so I can take a sentence like that and link companies to company pages. It works ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant