Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revise docs #20

Merged
merged 2 commits into from
Jun 19, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 31 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,38 @@
# rialto-airflow

[![.github/workflows/test.yml](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml/badge.svg)](https://github.com/sul-dlss-labs/rialto-airflow/actions/workflows/test.yml)
Airflow for harvesting data for open access analysis and research intelligence. The workflow is integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from sul_pub, look those publications up in OpenAlex and Dimensions using the DOI, merge the the author/department information found in [rialto_orgs], and publish the data to our JupyterHub environment.

Airflow for harvesting data for open access analysis and research intelligence. The workflow integrates data from [sul_pub](https://github.com/sul-dlss/sul_pub), [rialto-orgs](https://github.com/sul-dlss/rialto-orgs), [OpenAlex](https://openalex.org/) and [Dimensions](https://www.dimensions.ai/) APIs to provide a view of publication data for Stanford University research. The basic workflow is: fetch Stanford Research publications from SUL-Pub, OpenAlex, and Dimensions, enrich them with additional metadata from OpenAlex and Dimensions using the DOI, merge the organizational data found in [rialto_orgs], and publish the data to our JupyterHub environment.

```mermaid
flowchart TD
last_harvest(Determine last harvest) --> sul_pub(Publications from sul_pub)
sul_pub --> extract_doi(Extract DOIs)
extract_doi -- DOI --> openalex(OpenAlex)
extract_doi -- DOI --> dimensions(Dimensions)
dimensions --> merge_pubs(Merge Publications)
openalex --> merge_pubs(Merge Publications)
merge_pubs -- SUNETID --> join_departments(Join Departments)
join_departments --> publish(Publish)
last_harvest(Determine last harvest) --> sul_pub_harvest(SUL-Pub harvest)
sul_pub_harvest --> sul_pub_pubs[/SUL-Pub publications/]
rialto_orgs_export(Manual RIALTO app export) --> org_data[/Stanford organizational data/]
last_harvest --> dimensions_harvest_orcid(Dimensions harvest ORCID)
last_harvest --> openalex_harvest_orcid(OpenAlex harvest ORCID)
org_data --> dimensions_harvest_orcid
org_data --> openalex_harvest_orcid
dimensions_harvest_orcid --> dimensions_orcid_doi_dict[/Dimensions ORCID-DOI dictionary/]
openalex_harvest_orcid --> openalex_orcid_doi_dict[/OpenAlex ORCID-DOI dictionary/]
dimensions_orcid_doi_dict -- DOI --> doi_set(DOI set)
openalex_orcid_doi_dict -- DOI --> doi_set(DOI set)
sul_pub_pubs -- DOI --> doi_set(DOI set)
doi_set --> dois[/All unique DOIs/]
dois --> dimensions_enrich(Dimensions harvest DOI)
dois --> openalex_enrich(OpenAlex harvest DOI)
dimensions_enrich --> dimensions_enriched[/Dimensions publications/]
openalex_enrich --> openalex_enriched[/OpenAlex publications/]
dimensions_enriched -- DOI --> merge_pubs(Merge publications)
openalex_enriched -- DOI --> merge_pubs
sul_pub_pubs -- DOI --> merge_pubs
merge_pubs --> all_enriched_publications[/All publications/]
all_enriched_publications --> join_org_data(Join organizational data)
org_data --> join_org_data
join_org_data --> publication_set[/Publication set/]
publication_set -- DOI & (ORCID & SUNET) --> contributions(Publications to contributions)
contributions --> contributions_set[/Contributions set/]
contributions_set --> publish(Publish)
```

## Running Locally with Docker
Expand Down Expand Up @@ -53,7 +72,7 @@ done
uv venv
```

This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies.
This will create the virtual environment at the default location of `.venv/`. `uv` automatically looks for a venv at this location when installing dependencies.

3. Activate the virtual environment:
```
Expand All @@ -70,7 +89,7 @@ To add a dependency:
2. Add the dependency to `pyproject.toml`.
3. To re-generate the locked dependencies in `requirements.txt`:
```
uv pip compile pyproject.toml -o requirements.txt
uv pip compile pyproject.toml -o requirements.txt
```

Unlike poetry, uv's dependency resolution is not platform-agnostic. If we find we need to generate a requirements.txt for linux, we can use [uv's multi-platform resolution options](https://github.com/astral-sh/uv?tab=readme-ov-file#multi-platform-resolution).
Expand Down
Loading