-
Notifications
You must be signed in to change notification settings - Fork 915
Credentials in kedro
The basic pattern is as follows:
# conf/base/catalog.yml
dataset_name:
...
credentials: credentials_key
# conf/local/credentials.yml
credentials_key:
kwarg1: value1
kwarg2: value2
...
To be concrete, here's an example for Azure Blob Storage:
# conf/base/catalog.yml
shuttles:
type: pandas.CSVDataSet
filepath: abfs://somewhere/shuttles.csv
credentials: abs_credentials
# conf/local/credentials.yml
abs_credentials:
account_name: antonymilne
account_key: verysecretpassword
The credentials
key is injected into the call that instantiates pandas.CSVDataSet
when kedro is run. Specifically, here: https://github.com/kedro-org/kedro/blob/a925fd59187a642e124527f0f1097e92ea8d1819/kedro/io/data_catalog.py#L276
Note:
-
credentials
is a special reserved keyword. This doesn't work for any other key name - this is one of the very few customisations that kedro's
ConfigLoader
makes to how yaml is parsed. In-file variable injection is (kind of) supported in yaml using anchors, but injecting a variable from another file is not. The mechanism that does the injection here is entirely defined by kedro - the reason for enabling this custom behaviour is that credentials should not be committed to source control. Hence they need to be stored in a separate file outside the data catalog that lives in
local
and injected into the catalog at runtime
In my experience and from talking to users: in the case that credentials can be stored in a file, yes. Very little confusion is caused by the custom behaviour of injecting credentials.
The biggest problem is that credentials might not be stored in a file. Alternatives are:
- very common: storing credentials in environment variables rather than files. kedro can deal with this through
TemplatedConfigLoader
. This works ok but feels hacky and is so common it shouldn't really require a workaround - Python objects. e.g.
APIDataSet
works with arequests.auth.AuthBase
object for credentials;pandas.GBQTableDataSet
works withgoogle.oauth2.credentials.Credentials
. This is handled by instantiating the corresponding credentials class in the dataset using the kwargs given in the credentials.yml file. This works ok but is awkward and not done consistently throughout kedro (e.g. https://github.com/kedro-org/kedro/issues/711; https://github.com/kedro-org/kedro/issues/1621). - cloud-native solutions like AWS secrets. There's no direct way to use these in a catalog entry. I don't understand much (anything) about how these work but believe the same
TemplatedConfigLoader
trick as used for env vars would work here. See https://github.com/kedro-org/kedro/issues/1280 and https://github.com/kedro-org/kedro/issues/930 for more.
Another problem with credentials is that the way they are handled for PartitionedDataSet
is pretty complicated. I'm not sure we'll be able to solve that here but would be nice if we could.
At a bare minimum I think we need a way of directly injecting environment variables into credentials. Given how common this is outside credentials files also (using TemplatedConfigLoader
), my opinion is that this mechanism should not be credentials-specific but instead common across all kedro configuration.
e.g. with OmegaConf you'd do this as:
abs_credentials:
account_name: ${oc.env:ABS_ACCOUNT_NAME}
account_key: ${oc.env:ABS_ACCOUNT_KEY}
Quotes from https://github.com/kedro-org/kedro/issues/770:
@idanov: [credentials] is obviously environment specific, but what we should consider doing is adding an environment variables support. Unfortunately this has been on the backlog for a while, but doesn’t seem to be such an important issue that cannot be solved by DevOps, so we never got to implementing the environment variables for credentials. @Galileo-Galilei: I do not understand what you mean by [this]. My point is precisely that many CI/CD tools expect to communicate with the underlying application through environment variables (to my knowledge: I must confess that I am far from being a devops expert), and it is really weird to me that is not "native" in kedro. I must switch to the
TemplatedConfigLoader
on deployment mode even if I use acredentials.yml
file while developping, and it feels uncomfortable to have to change something for deployment (even if it is very easy to change).
So far the best discussion of this is in https://github.com/kedro-org/kedro/issues/1280. From @Galileo-Galilei:
I have no time for now, and it will likely take weeks before I came up with something intelligible, but this is a topic on which I plan to write a "Universal Kedro Deployment" issue. I think there are some adherence with this #770, but credentials have a lot of specificities indeed. In short my idea is that:
kedro
should have an abstract class (~roughly similar to AbstractDataSet, sayCredentialsManager
) to implement the_get_credentials()
function. It should be able to get credentials from anywhere (e.g.VaultCredentialsManager
,GithubCredentialsManager
andFileCredentialsManager
which would default to current implementation) and return adict
of credentials.- This class should leverage the
ConfigLoader
when possible- This class should be parametrized in the
settings.py
Also worth noting the factory approach of @daBlesr discussed in https://github.com/kedro-org/kedro/issues/711#issuecomment-1159147466 and following comments.
I don't yet have any particular ideas myself here so I'd love to hear what others think and hear @Galileo-Galilei's idea in more detail 🚀 It would be especially great to hear from people who use cloud-native credentials systems like AWS secrets. This is a bit of a blindspot for us at the moment I think.
- Contribute to Kedro
- Guidelines for contributing developers
- Contribute changes to Kedro that are tested on Databricks
- Backwards compatibility and breaking changes
- Contribute to the Kedro documentation
- Kedro documentation style guide
- Creating developer documentation
- Kedro new project creation - how it works
- The CI Setup: GitHub Actions
- The Performance Test Setup: Airspeed Velocity
- Kedro Framework team norms & ways of working ⭐️
- Kedro Framework Pull Request and Review team norms ✍️