From 86448b09d9d602d605e252b20bfdaaaf43c4761c Mon Sep 17 00:00:00 2001 From: jacobthill Date: Mon, 17 Jun 2024 13:28:29 -0400 Subject: [PATCH 1/2] revise docs --- README.md | 43 +++++++++++++++++++++++++++++++------------ 1 file changed, 31 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 008d161..7660aaa 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,38 @@ # rialto-airflow [![.github/workflows/test.yml](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml/badge.svg)](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml) - -Airflow for harvesting data for open access analysis and research intelligence. The workflow is integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from sul_pub, look those publications up in OpenAlex and Dimensions using the DOI, merge the the author/department information found in [rialto_orgs], and publish the data to our JupyterHub environment. + +Airflow for harvesting data for open access analysis and research intelligence. The workflow integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from SUL-Pub, OpenAlex, and Dimensions, enrich them with additional metadata from OpenAlex and Dimensions using the DOI, merge the organizational data found in [rialto_orgs], and publish the data to our JupyterHub environment. ```mermaid flowchart TD - last_harvest(Determine last harvest) --> sul_pub(Publications from sul_pub) - sul_pub --> extract_doi(Extract DOIs) - extract_doi -- DOI --> openalex(OpenAlex) - extract_doi -- DOI --> dimensions(Dimensions) - dimensions --> merge_pubs(Merge Publications) - openalex --> merge_pubs(Merge Publications) - merge_pubs -- SUNETID --> join_departments(Join Departments) - join_departments --> publish(Publish) + last_harvest(Determine last harvest) --> sul_pub_harvest(SUL-Pub harvest) + sul_pub_harvest --> sul_pub_pubs[/SUL-Pub publications/] + rialto_orgs_export --> last_harvest + last_harvest --> dimensions_harvest_orcid(Dimensions harvest ORCID) + last_harvest --> openalex_harvest_orcid(OpenAlex harvest ORCID) + dimensions_harvest_orcid --> dimensions_contribs[/Dimensions contributions/] + openalex_harvest_orcid --> openalex_contribs[/OpenAlex contributions/] + dimensions_contribs --> contribs_to_pubs + openalex_contribs --> contribs_to_pubs + contribs_to_pubs --> dimensions_pubs[/Dimensions publications/] + contribs_to_pubs --> openalex_pubs[/OpenAlex publications/] + dimensions_pubs -- DOI --> merge_pubs(Merge publications) + openalex_pubs -- DOI --> merge_pubs(Merge publications) + sul_pub_pubs -- DOI --> merge_pubs(Merge publications) + merge_pubs --> drop_duplicates(Remove duplicates) + drop_duplicates --> all_pubs[/All publications/] + all_pubs --> extract_dois(Extract DOIs) + extract_dois --> dois[/Unique DOIs/] + dois --> dimensions_enrich(Dimensions harvest DOI) + dois --> openalex_enrich(OpenAlex harvest DOI) + openalex_enrich --> openalex_enriched[/OpenAlex enriched publications/] + dimensions_enriched -- DOI --> merge_pubs_two(Merge publications) + openalex_enriched -- DOI --> merge_pubs_two(Merge publications) + rialto_orgs_export --> join_org_data + merge_pubs_two -- SUNETID --> join_org_data(Join organizational data) + join_org_data --> all_enriched_publications[/All enriched publications/] + all_enriched_publications --> publish(Publish) ``` ## Running Locally with Docker @@ -53,7 +72,7 @@ done uv venv ``` -This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies. +This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies. 3. Activate the virtual environment: ``` @@ -70,7 +89,7 @@ To add a dependency: 2. Add the dependency to `pyproject.toml`. 3. To re-generate the locked dependencies in `requirements.txt`: ``` -uv pip compile pyproject.toml -o requirements.txt +uv pip compile pyproject.toml -o requirements.txt ``` Unlike poetry, uv's dependency resolution is not platform-agnostic. If we find we need to generate a requirements.txt for linux, we can use [uv's multi-platform resolution options](https://github.com/astral-sh/uv?tab=readme-ov-file#multi-platform-resolution). From 709064e1fd0af18b4706dd9093d2b2e16beff4ad Mon Sep 17 00:00:00 2001 From: jacobthill Date: Tue, 18 Jun 2024 15:01:05 -0400 Subject: [PATCH 2/2] update diagram --- README.md | 42 +++++++++++++++++++++--------------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 7660aaa..d117d24 100644 --- a/README.md +++ b/README.md @@ -8,31 +8,31 @@ Airflow for harvesting data for open access analysis and research intelligence. flowchart TD last_harvest(Determine last harvest) --> sul_pub_harvest(SUL-Pub harvest) sul_pub_harvest --> sul_pub_pubs[/SUL-Pub publications/] - rialto_orgs_export --> last_harvest + rialto_orgs_export(Manual RIALTO app export) --> org_data[/Stanford organizational data/] last_harvest --> dimensions_harvest_orcid(Dimensions harvest ORCID) last_harvest --> openalex_harvest_orcid(OpenAlex harvest ORCID) - dimensions_harvest_orcid --> dimensions_contribs[/Dimensions contributions/] - openalex_harvest_orcid --> openalex_contribs[/OpenAlex contributions/] - dimensions_contribs --> contribs_to_pubs - openalex_contribs --> contribs_to_pubs - contribs_to_pubs --> dimensions_pubs[/Dimensions publications/] - contribs_to_pubs --> openalex_pubs[/OpenAlex publications/] - dimensions_pubs -- DOI --> merge_pubs(Merge publications) - openalex_pubs -- DOI --> merge_pubs(Merge publications) - sul_pub_pubs -- DOI --> merge_pubs(Merge publications) - merge_pubs --> drop_duplicates(Remove duplicates) - drop_duplicates --> all_pubs[/All publications/] - all_pubs --> extract_dois(Extract DOIs) - extract_dois --> dois[/Unique DOIs/] + org_data --> dimensions_harvest_orcid + org_data --> openalex_harvest_orcid + dimensions_harvest_orcid --> dimensions_orcid_doi_dict[/Dimensions ORCID-DOI dictionary/] + openalex_harvest_orcid --> openalex_orcid_doi_dict[/OpenAlex ORCID-DOI dictionary/] + dimensions_orcid_doi_dict -- DOI --> doi_set(DOI set) + openalex_orcid_doi_dict -- DOI --> doi_set(DOI set) + sul_pub_pubs -- DOI --> doi_set(DOI set) + doi_set --> dois[/All unique DOIs/] dois --> dimensions_enrich(Dimensions harvest DOI) dois --> openalex_enrich(OpenAlex harvest DOI) - openalex_enrich --> openalex_enriched[/OpenAlex enriched publications/] - dimensions_enriched -- DOI --> merge_pubs_two(Merge publications) - openalex_enriched -- DOI --> merge_pubs_two(Merge publications) - rialto_orgs_export --> join_org_data - merge_pubs_two -- SUNETID --> join_org_data(Join organizational data) - join_org_data --> all_enriched_publications[/All enriched publications/] - all_enriched_publications --> publish(Publish) + dimensions_enrich --> dimensions_enriched[/Dimensions publications/] + openalex_enrich --> openalex_enriched[/OpenAlex publications/] + dimensions_enriched -- DOI --> merge_pubs(Merge publications) + openalex_enriched -- DOI --> merge_pubs + sul_pub_pubs -- DOI --> merge_pubs + merge_pubs --> all_enriched_publications[/All publications/] + all_enriched_publications --> join_org_data(Join organizational data) + org_data --> join_org_data + join_org_data --> publication_set[/Publication set/] + publication_set -- DOI & (ORCID & SUNET) --> contributions(Publications to contributions) + contributions --> contributions_set[/Contributions set/] + contributions_set --> publish(Publish) ``` ## Running Locally with Docker