-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data corrections & additions management #6
Comments
I think that the solution you proposed in the PR is correct. Let the user decides which files it tries to load and provide a "default" and last to date version. By my work, I had been somehow aware of the update made by LEI, but it went at a bad time and I forgot to update the package. I also think that the "cleaned" dataset should disappear, in fine, I will contact gleif to see if we can get a compromise or if we can add indicator of position (some legal form are always at the end, glued with the name or on both side). |
Ok. I think it's going to be a long time until all the codes in CSVs provided by gleif are up to date with adequate data - if ever. If for not any other reason, then just because some parts of the data are not considered so important by the organizations that update them (say, legal form name abbreviations for example - there are many missing). Yet some uses rely on this data. For example the https://github.com/psolin/cleanco package uses the abbreviations information to help people determine base names of organizations. The term data definitions list of that package has about 150 unique abbreviations that do not exist in the CSV data provided by gleif. Even if a significant percentage of those are not valid (they might be old etc), there still are many that are missing from gleif data (I've checked). So I am wondering if the Elf class could be for example extended so that it would support incorporating new or updated entries from other sources than the gleif CSV, at runtime. Or perhaps it'd be sufficient to simply be able to point to another CSV :) Perhaps I can submit an attempt at that in another PR. |
I also wanted to add additional forms (as I'm working with a big database for companies ~400M). But I had 2 issues:
|
For 1., how about adopting an extension code for new forms? The identifiers seem to conform to a Regarding 2., yeah I guess that's always going to be an issue. Perhaps it can be mitigated by restricting the set of forms by jurisdiction, language etc. in those cases where that is known for the company names. But I don't think there can really be a perfect solution. |
I have added some additional legal forms. It's based on a big mix of what I found on Internet, what I found on the biggest database of companies (wink wink) and on Wikipedia. It may, of course, contain errors. Data is separated by countries:
I will try to go once again in the dataset to try to enhance it. About the ELF identifier pattern, there are two mysterious lines: |
ISO data provided by gleif is a incomplete work in progress whose quality currently leaves a lot to hope for. Currently, this package addresses the data issue by providing the original gleif version & a "cleaned" version.
How should data issues be handled going forward? Issues & PRs could be used for data updates of course, and gleif has its own 'challenge' process as it should - ultimately the question of data quality lies at ISO & gleif.
Processing updates in the package would take effort. Any gleif updates are not that frequent - and even if they were, they need to be added to the package by someone. It's also reasonable to presume that people need something more expedient that they have control over.
Would it therefore be a good idea to provide means to override the data easily? Perhaps by a custom "overrides" CSV file that users could provide if they want?
Thoughts? I can submit a PR for an override mechanism if that seems useful.
The text was updated successfully, but these errors were encountered: