Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenAlex publications CSV #51

Merged
merged 3 commits into from
Jun 24, 2024
Merged

OpenAlex publications CSV #51

merged 3 commits into from
Jun 24, 2024

Conversation

lwrubel
Copy link
Collaborator

@lwrubel lwrubel commented Jun 24, 2024

Resolves #8 to query OpenAlex by DOI to create a publications CSV.

I followed the model you're using for the dimensions publications lookup, @edsu so some of the code will look familiar.

@lwrubel lwrubel force-pushed the t8-openalex-pubs-enrich branch from cb13d18 to 79cf3e1 Compare June 24, 2024 20:30

from rialto_airflow.utils import invert_dict

config.email = os.environ.get("AIRFLOW_VAR_OPENALEX_EMAIL")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of wondering if we should just do this for all the environment variables, instead of using airflow.models.Variable and passing things down.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed harder to set up the tests when using Variable as well.

writer.writerow(pub)


def publications_from_dois(dois: list, batch_size=75):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious did you arrive at 75 through experimentation to see what was possible? Or was it documented?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do larger batches, but the request ends up being too long (over 4096). I kept getting Bad Request errors when the requests were too large. So this seemed to get us safely below that threshold. I'll add a comment.

return pub


FIELDS = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move other global variables to all caps when we have a chance. It reads better I think.

@edsu edsu merged commit 485e4bb into main Jun 24, 2024
1 check passed
@edsu edsu deleted the t8-openalex-pubs-enrich branch June 24, 2024 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

openalex_harvest_doi (Enrich publication metadata by querying OpenAlex)
2 participants