Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline for automated KG generation #13

Open
stap-m opened this issue Oct 11, 2022 · 10 comments
Open

Pipeline for automated KG generation #13

stap-m opened this issue Oct 11, 2022 · 10 comments
Assignees
Labels
enhancement New feature or request

Comments

@stap-m
Copy link
Contributor

stap-m commented Oct 11, 2022

In the internal OvGU-meeting with @adelmemariani @fabianneuhaus and myself we developed a workflow for an automated KG generation.
The task is now to establish the basic pipeline for this KG such that a first version can be created.
Semantic enrichment etc. should not be considered at this stage and will be adressed later.

KG and workflow
grafik

@fabianneuhaus
Copy link

fabianneuhaus commented Oct 11, 2022

Just a minor comment concerning the RDF pattern in the diagram. I think it is unnecessarily complicated. I would suggest that the pattern should be along the following lines.

oekg:scenario123 a oeo:Scenario.
oekg:scenario123 xyz:has_IRI < address of website on OEP > .
oekg:scenario123 xyz:has_record oekg:table456.
oekg:table456 a xyz:Table.
oekg:table456 xyz:has_IRI < address of website on OEP > .
oekg:table456 is about oeo:entity.

I am not sure about oeo:Scenario, xyz:has_record and xyz:Table entities. Firstly, are the tables associated with a scenario or a scenario projection? Secondly, depending on the answer on the first question, we need a relation that links it to an information entity, namely a table. It is probably a good idea to look at the OBI to see whether we can reuse a relation and a class from them. But regardless of whether we use oeo:Scenario, xyz:has_record and xyz:Table or some other IRIs, the pattern should be correct.

EDIT: Included the line connecting scenario and table to OEP. I am not sure what ontology term for xyz:has_IRI.

@adelmemariani
Copy link
Contributor

adelmemariani commented Oct 11, 2022

Sometimes, datasets contain scenarios:

Screenshot 2022-10-11 at 21 41 41

Screenshot 2022-10-11 at 21 51 59

Also, a scenario usually has many datasets(as input: assumptions, model parameter ..., as output: projections)
This makes it difficult to make a pipeline. Besides, dataset values are not easily mappable to OEO concepts because users choosed vague and abbreviated terms.

@stap-m
Copy link
Contributor Author

stap-m commented Oct 12, 2022

Firstly, are the tables associated with a scenario or a scenario projection?

Yes. Currently the connection between tables and scenarios works mainly via the tags in the scenario schema, but in the future this link has to (also) be made via the factsheets/bundles.

Sometimes, datasets contain scenarios:

That means, that there are tables that are used in more than one scenario. But that should be no problem, as far as the assignment exists also outside the tables, right?

@adelmemariani
Copy link
Contributor

adelmemariani commented Oct 12, 2022

That means, that there are tables that are used in more than one scenario. But that should be no problem, as far as the assignment also exists outside the tables, right?

That is also my question: whether or not we have an explicit connection (usable via APIs) between the scenario and its datasets. But 'tags' work for filtering in this case.

@fabianneuhaus
Copy link

That means, that there are tables that are used in more than one scenario. But that should be no problem, as far as the assignment exists also outside the tables, right?

No, it should be no problem. At least not for the "dumb and dirty" approach that we are currently following. Our approach contains of going through the content of all tables that are associated with a scenario projection. If an entry is either an OEO term or has been annotated by a third party with an OEO term, we use it as as object in an is-about triple. If it is something else, we try to automatically match it to an OEO term. (In the first approach by simple string matching, at some later stage we can improve that by using more sophisticated approaches.) Since the names of scenarios won't be in the OEO, tables that contain names of other scenarios won't be matched and, thus, ignored. That's ok. Actually, I expect that most of the terms won't be automatically be matchable to something in the OEO, even if we use very sophisticated methods.

@adelmemariani
Copy link
Contributor

adelmemariani commented Oct 12, 2022

As a first step, the following 'dumb and dirty' versions are the results of a pipeline based on simple 'string matching' between values in the tables and OEO concepts:

With IRIs:
https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/Dummy_OEKG_With_Senario_Datasets.ttl

With labels:
https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/Dummy_OEKG_With_Senario_Datasets_With_Labels.ttl

@stap-m
Copy link
Contributor Author

stap-m commented Oct 12, 2022

The following is the list of 'not assignable terms’ for datasets that belong to KS_2050:
https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/not_assignables.txt

Thanks @adelmemariani . Let's continue the discussion here.

Does your script consider synonyms and alternative terms that are given in the OEO? I'm wondering, why PJ wasn't found. It is as annotated as exact synonym of petajoule (OEO_00050006).

@adelmemariani
Copy link
Contributor

adelmemariani commented Oct 12, 2022

Does your script consider synonyms and alternative terms that are given in the OEO? I'm wondering, why PJ wasn't found. It is as annotated as exact synonym of petajoule (OEO_00050006).

😮 My script was not aware of 'synonyms' so far. Thnaks @stap-m . I will work on it...

@adelmemariani
Copy link
Contributor

adelmemariani commented Oct 12, 2022

By considering the has exact synonym relations, the 'petajoule' and 'PJ' is now mappable and 'PJ' is no longer in the list of unassignable terms:
https://github.com/OpenEnergyPlatform/oekg/blob/Trial_autogenerated_oekg_via_pieline/Dummy_OEKG_With_Senario_Datasets_With_Labels_And_IRIs.ttl#L376

The overall result would be much better if we have synonyms for other unassignable terms.

@stap-m
Copy link
Contributor Author

stap-m commented Oct 13, 2022

😮 My script was not aware of 'synonyms' so far. Thnaks @stap-m . I will work on it...

Acutally, we agreed on using alternative term instead of synonyms, but appearently, there are still some artifacts...

@Ludee Ludee added the enhancement New feature or request label Oct 13, 2022
@Ludee Ludee changed the title create pipeline for automated KG-generation Pipeline for automated KG generation Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants