Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional tags for DataSets and accessing through the catalog #324

Closed
tdrobbin opened this issue Apr 13, 2020 · 4 comments
Closed

Optional tags for DataSets and accessing through the catalog #324

tdrobbin opened this issue Apr 13, 2020 · 4 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@tdrobbin
Copy link

tdrobbin commented Apr 13, 2020

Description

It would be nice to be able to add optional tags to datasets in the catalog, and list/access them via tags, similar to what is currently implemented in node

Context

The DataSet/Catalog system and the python interface are awesome. As the number of datasets defined in the catalog grows it would be helpful to be able to list/access specific subsets based on common tags.

Possible Implementation

I'm imagining adding an optional tags attribute in the catalog yml like so:

...
bikes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/bikes.csv
  tags: 
    - transportation
    - has_wheels

trains:
  type: pandas.CSVDataSet
  filepath: data/01_raw/bikes.csv
  tags: 
    - transportation
...

The the ability to view them in the data calog from python based on the tags where catalog.list might have the following api:

catalog.list(tags=['transportation'])

which might return

['bikes', 'trains', ...]

If I wanted to do something specifically with just transportation data I can easily loop through this list and load these datasources with catalog.load without having to explicitly list them in my python code. This helps keep the catalog as the one source of truth.

The following simple (but hacky) modifications are working so far. When a DataSet is created there could be an instance variable _tags and then the list method in DataCatalog could look like this:

def list(self, tags: List[str] = None) -> List[str]:
    """List of ``DataSet`` names registered in the catalog.
    Args:
        tags: list of tags that will subset the list of returned datasets
            to those which match on at least one tag.
            
    Returns:
        A List of ``DataSet`` names, corresponding to the entries that are
        registered in the current catalog object.
    """
    if tags is not None:
        matched_data_sets = []
        for key, ds in self._data_sets.items():
            if set(tags).intersection(set(ds._tags)):
                matched_data_sets.append(key)
        
        return matched_data_sets

    return list(self._data_sets.keys())

And the following modification to AbstractDataSet is not ideal but seems to work:

    @classmethod
    def from_config(
        cls: Type,
        name: str,
        config: Dict[str, Any],
        load_version: str = None,
        save_version: str = None,
    ) -> "AbstractDataSet":
        """Create a data set instance using the configuration provided.

        Args:
            name: Data set name.
            config: Data set config dictionary.
            load_version: Version string to be used for ``load`` operation if
                the data set is versioned. Has no effect on the data set
                if versioning was not enabled.
            save_version: Version string to be used for ``save`` operation if
                the data set is versioned. Has no effect on the data set
                if versioning was not enabled.

        Returns:
            An instance of an ``AbstractDataSet`` subclass.

        Raises:
            DataSetError: When the function fails to create the data set
                from its config.

        """
        try:
            class_obj, config = parse_dataset_definition(
                config, load_version, save_version
            )
        except Exception as ex:
            raise DataSetError(
                "An exception occurred when parsing config "
                "for DataSet `{}`:\n{}".format(name, str(ex))
            )

        tags = None
        if 'tags' in list(config.keys()):
            tags = config['tags']
            del config['tags']

        try:
            data_set = class_obj(**config)  # type: ignore
        except TypeError as err:
            raise DataSetError(
                "\n{}.\nDataSet '{}' must only contain "
                "arguments valid for the constructor "
                "of `{}.{}`.".format(
                    str(err), name, class_obj.__module__, class_obj.__qualname__
                )
            )
        except Exception as err:
            raise DataSetError(
                "\n{}.\nFailed to instantiate DataSet "
                "'{}' of type `{}.{}`.".format(
                    str(err), name, class_obj.__module__, class_obj.__qualname__
                )
            )
        
        if tags is not None:
            data_set._tags = tags

        return data_set

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

@tdrobbin tdrobbin added the Issue: Feature Request New feature or improvement to existing feature label Apr 13, 2020
@WaylonWalker
Copy link
Contributor

I personally like the idea of having the datasets tagged as well. I don't necessarily want to maintain tags in two places though and have to manage to keep them in sync.

Alternatively today, without any changes, you can ask for pipeline nodes that have a certain set of tags tp = pipeline.only_nodes_with_tags('transportation'). Then you can ask that pipeline for tp.all_inputs(), tp.all_outputs(), tp.outputs(), tp.inputs(), tp.data_sets().

sarchila pushed a commit to sarchila/kedro that referenced this issue Apr 15, 2020
* Reduced text to add in decription of why Kedro exists
* Changed pipeline image
* Moved content to FAQ
@lorenabalan
Copy link
Contributor

Closing this as duplicate of #400

@lucianoviola
Copy link

I also think this would be very useful.

@yetudada
Copy link
Contributor

This issue was opened forever ago and we've made it possible with #2537. Check out the thread on #1076. Thank you so much for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

5 participants