Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: break out CLI for more specific commands/more reasonable defaults #344

Merged
merged 20 commits into from
May 21, 2024

Conversation

jsstevenson
Copy link
Member

@jsstevenson jsstevenson commented May 14, 2024

close #210

Miscellaneous quality of life improvements and feature additions to CLI:

  • Add metakb as a console command.
  • Break out CLI subcommands:
    • metakb update to run a complete harvest/transform/load step
    • metakb check-normalizers and metakb update-normalizers to check and force refreshing of normalizer data. This supports a simple workflow like metakb check-normalizers || metakb load-normalizers to load if unavailable, rather than requiring the user to force normalizer reload while loading the MetaKB graph.
    • metakb harvest to just perform harvest of source(s)
    • metakb transform to just perform transform of source(s), or metakb transform-file to transform a specific harvested file
    • metakb load-cdm to skip harvest/transform and directly load a CDM file, either from local (default location), a specific file, or from S3
    • metakb clear-graph to wipe the graph. No other CLI command will wipe the graph. I thought about calling it when update is used without any source qualifiers, but it seemed a little odd to include additional behavior such that metakb update <source> && metakb update <other source> is different from metakb update. Also thought about including it as an option flag in some other commands, but at that point, you can just do metakb clear-graph && <other command>.
    • Previously, we had one command that could be used to do a lot of things. IMO it's cleaner and more intuitive (and easier to program/maintain) to have subcommands with specific purposes instead of one big one. It does mean you will often have to chain commands together but that's pretty normal.
  • Support selection of specific sources for commands where it makes sense. Pass them as arguments. Otherwise default to all sources.
  • Support output directory option (--output_directory, -o) where it makes sense. Unfortunately, most of these commands all produce n output files so I don't think there's a simple way to specify the name of the output file.
  • Collapse the credentials CLI option to a single username:password option, since you need to provide both at once. (Not sure why Neo4j requires a password).
  • When updating normalizers, keep going to the next one if one of them fails.
  • In general, give precedence to explicit args/options over env vars. Consequently, pass normalizer DB URL as an arg to normalizers rather than injecting it via env var to the transform step, which required a small refactor.
  • Refactor a few normalizer management functions out of the CLI module. In general, I think it's good to separate reusable functions out of CLI modules, and just have cli.py act as gateways/interfaces to them. There's probably a little bit more of this that we could do but nothing else stuck out to me.
  • Various changes to make console printouts a bit prettier/more organized.

@jsstevenson jsstevenson added the priority:medium Medium priority label May 14, 2024
@korikuzma
Copy link
Member

@jsstevenson did you see this old issue #210?

@jsstevenson jsstevenson changed the title feat!: restructure CLI for division of concerns and ease of use feat!: break out CLI for more specific commands/more reasonable defaults May 15, 2024
@jsstevenson
Copy link
Member Author

one possible additional todo: add -o to things that produce file outputs to specify output location

Comment on lines 151 to 161
for name, aws_env_var_name in update_params:
if aws_env_var_name in environ:
msg = (
f"Updating the {name.value} AWS database from the MetaKB CLI is "
f"prohibited. Unset the environment variable:`{aws_env_var_name}` to "
"proceed."
)
_logger.error(msg)
click.echo(msg)
success = False
continue
Copy link
Member Author

@jsstevenson jsstevenson May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we're fully set up to update normalizers in the cloud, I wonder if there's some configuration we could add to the Dynamo table templates that only allow writes/deletes to come from the identity that's granted to the cloud update services. It'd be nice if we didn't have to write all of these checks into the code itself.

@jsstevenson
Copy link
Member Author

@korikuzma having some second thoughts about what commands/combinations of commands should call graph.clear()

@korikuzma
Copy link
Member

@korikuzma having some second thoughts about what commands/combinations of commands should call graph.clear()

@jsstevenson I haven't looked at the changes, but what if we just separate this out and place it on the user to decide and make a note in the documentation?

Comment on lines +24 to +32
def __repr__(self) -> str:
"""Print as simple string rather than enum wrapper, e.g. 'civic' instead of
<NormalizerName.CIVIC: 'civic'>.

Makes Click error messages prettier.

:return: formatted enum value
"""
return f"'{self.value}'"
Copy link
Member Author

@jsstevenson jsstevenson May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bigger explanation of this:

For the most part, when you give enum values to click.Choice, it'll format them correctly. For example:

[ cli-refactor ⚙ .venv] ~/code/metakb % metakb update --help
Usage: metakb update [OPTIONS] [[civic|moa]]...

However, if you pass an invalid arg, it prints them using __repr__, so it's ugly:

Error: Invalid value for '[[civic|moa]]...': 'sdflkd' is not one of <NormalizerName.CIVIC: 'civic'>, <NormalizerName.MOA: 'moa'>.

With this fix it instead looks like this

Error: Invalid value for '[[civic|moa]]...': 'sdflkd' is not one of 'civic', 'moa'.

I don't think there are any major consequences to overwriting __repr__ like this but when debugging you have to be a little careful because it'll print without the enum wrapping. That said, not totally married to this change, I don't think it's a big deal that the validation error prints that way in the first place.

return newest_version


if __name__ == "__main__":
update_metakb_db(_anyio_backend="asyncio")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not totally sure why the asyncio thing was needed? maybe an old click version? I've been able to run the async transform command without problems.

@jsstevenson jsstevenson marked this pull request as ready for review May 17, 2024 13:51
```

For more information on the different CLI arguments, see the [CLI README](docs/cli/README.md).
The `--help` flag can be provided to any CLI command to bring up additional documentation.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a way to embed Click docs into Sphinx, so I'll do that in the docs branch.

@jsstevenson jsstevenson requested a review from korikuzma May 17, 2024 14:43


@cli.command()
@click.option("--normalizer_db_url", "-n", help=_normalizer_db_url_description)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Dynamo takes a single URL but Postgres requires different libpq-style URLs that include normalizer-specific database names, I don't think it's really feasible to support postgres connections this way yet. Thought about doing something that could take a base postgres URL and add database names on top but it's probably not worth the work right now.

Copy link
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 🧼

src/metakb/normalizers.py Outdated Show resolved Hide resolved
src/metakb/schemas/app.py Outdated Show resolved Hide resolved
src/metakb/cli.py Outdated Show resolved Hide resolved
src/metakb/cli.py Outdated Show resolved Hide resolved
src/metakb/cli.py Outdated Show resolved Hide resolved
@jsstevenson jsstevenson requested a review from korikuzma May 18, 2024 19:34
Copy link
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last small thoughts

src/metakb/normalizers.py Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some places where singular name is used for arguments (source, normalizer). It might make sense to make these plural since they can take > 1 value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some places where singular name is used for arguments (source, normalizer). It might make sense to make these plural since they can take > 1 value

@korikuzma yeah agreed that this becomes awkward in code. My sense, FWIW, is that it's more common on the help message itself to make the argument name singular and then represent plurality with ellipses, e.g. [FILE ...] or [FILE] .... (See eg here or here or man ls). We can use the metavar param to specify how the arg is named in help messages while using a more appropriate variable name in the CLI code itself.

@jsstevenson jsstevenson requested a review from korikuzma May 21, 2024 01:19
Copy link
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small requests

src/metakb/cli.py Outdated Show resolved Hide resolved
src/metakb/normalizers.py Outdated Show resolved Hide resolved
@jsstevenson jsstevenson requested a review from korikuzma May 21, 2024 18:00
@jsstevenson jsstevenson merged commit 5ce50c3 into staging May 21, 2024
11 checks passed
@jsstevenson jsstevenson deleted the cli-refactor branch May 21, 2024 21:19
@github-actions github-actions bot mentioned this pull request May 21, 2024
2 tasks
@jsstevenson jsstevenson mentioned this pull request Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:medium Medium priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants