Optional tags for DataSets and accessing through the catalog #324

tdrobbin · 2020-04-13T01:59:08Z

Description

It would be nice to be able to add optional tags to datasets in the catalog, and list/access them via tags, similar to what is currently implemented in node

Context

The DataSet/Catalog system and the python interface are awesome. As the number of datasets defined in the catalog grows it would be helpful to be able to list/access specific subsets based on common tags.

Possible Implementation

I'm imagining adding an optional tags attribute in the catalog yml like so:

...
bikes:
  type: pandas.CSVDataSet
  filepath: data/01_raw/bikes.csv
  tags: 
    - transportation
    - has_wheels

trains:
  type: pandas.CSVDataSet
  filepath: data/01_raw/bikes.csv
  tags: 
    - transportation
...

The the ability to view them in the data calog from python based on the tags where catalog.list might have the following api:

catalog.list(tags=['transportation'])

which might return

['bikes', 'trains', ...]

If I wanted to do something specifically with just transportation data I can easily loop through this list and load these datasources with catalog.load without having to explicitly list them in my python code. This helps keep the catalog as the one source of truth.

The following simple (but hacky) modifications are working so far. When a DataSet is created there could be an instance variable _tags and then the list method in DataCatalog could look like this:

def list(self, tags: List[str] = None) -> List[str]:
    """List of ``DataSet`` names registered in the catalog.
    Args:
        tags: list of tags that will subset the list of returned datasets
            to those which match on at least one tag.
            
    Returns:
        A List of ``DataSet`` names, corresponding to the entries that are
        registered in the current catalog object.
    """
    if tags is not None:
        matched_data_sets = []
        for key, ds in self._data_sets.items():
            if set(tags).intersection(set(ds._tags)):
                matched_data_sets.append(key)
        
        return matched_data_sets

    return list(self._data_sets.keys())

And the following modification to AbstractDataSet is not ideal but seems to work:

    @classmethod
    def from_config(
        cls: Type,
        name: str,
        config: Dict[str, Any],
        load_version: str = None,
        save_version: str = None,
    ) -> "AbstractDataSet":
        """Create a data set instance using the configuration provided.

        Args:
            name: Data set name.
            config: Data set config dictionary.
            load_version: Version string to be used for ``load`` operation if
                the data set is versioned. Has no effect on the data set
                if versioning was not enabled.
            save_version: Version string to be used for ``save`` operation if
                the data set is versioned. Has no effect on the data set
                if versioning was not enabled.

        Returns:
            An instance of an ``AbstractDataSet`` subclass.

        Raises:
            DataSetError: When the function fails to create the data set
                from its config.

        """
        try:
            class_obj, config = parse_dataset_definition(
                config, load_version, save_version
            )
        except Exception as ex:
            raise DataSetError(
                "An exception occurred when parsing config "
                "for DataSet `{}`:\n{}".format(name, str(ex))
            )

        tags = None
        if 'tags' in list(config.keys()):
            tags = config['tags']
            del config['tags']

        try:
            data_set = class_obj(**config)  # type: ignore
        except TypeError as err:
            raise DataSetError(
                "\n{}.\nDataSet '{}' must only contain "
                "arguments valid for the constructor "
                "of `{}.{}`.".format(
                    str(err), name, class_obj.__module__, class_obj.__qualname__
                )
            )
        except Exception as err:
            raise DataSetError(
                "\n{}.\nFailed to instantiate DataSet "
                "'{}' of type `{}.{}`.".format(
                    str(err), name, class_obj.__module__, class_obj.__qualname__
                )
            )
        
        if tags is not None:
            data_set._tags = tags

        return data_set

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

The text was updated successfully, but these errors were encountered:

WaylonWalker · 2020-04-14T13:41:02Z

I personally like the idea of having the datasets tagged as well. I don't necessarily want to maintain tags in two places though and have to manage to keep them in sync.

Alternatively today, without any changes, you can ask for pipeline nodes that have a certain set of tags tp = pipeline.only_nodes_with_tags('transportation'). Then you can ask that pipeline for tp.all_inputs(), tp.all_outputs(), tp.outputs(), tp.inputs(), tp.data_sets().

* Reduced text to add in decription of why Kedro exists * Changed pipeline image * Moved content to FAQ

lorenabalan · 2020-10-05T14:28:12Z

Closing this as duplicate of #400

lucianoviola · 2021-01-17T02:27:16Z

I also think this would be very useful.

yetudada · 2023-05-26T09:51:30Z

This issue was opened forever ago and we've made it possible with #2537. Check out the thread on #1076. Thank you so much for the feedback!

tdrobbin added the Issue: Feature Request New feature or improvement to existing feature label Apr 13, 2020

sarchila pushed a commit to sarchila/kedro that referenced this issue Apr 15, 2020

[KED-1174] README.md redesign (kedro-org#324)

67aa3f4

* Reduced text to add in decription of why Kedro exists * Changed pipeline image * Moved content to FAQ

tdrobbin mentioned this issue Jul 1, 2020

Allow new attributes to be added to DataSets #400

Closed

lorenabalan closed this as completed Oct 5, 2020

yetudada mentioned this issue Nov 30, 2021

Allow new attributes to be added to DataSets #1076

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional tags for DataSets and accessing through the catalog #324

Optional tags for DataSets and accessing through the catalog #324

tdrobbin commented Apr 13, 2020 •

edited

Loading

WaylonWalker commented Apr 14, 2020

lorenabalan commented Oct 5, 2020

lucianoviola commented Jan 17, 2021

yetudada commented May 26, 2023

Optional tags for DataSets and accessing through the catalog #324

Optional tags for DataSets and accessing through the catalog #324

Comments

tdrobbin commented Apr 13, 2020 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

WaylonWalker commented Apr 14, 2020

lorenabalan commented Oct 5, 2020

lucianoviola commented Jan 17, 2021

yetudada commented May 26, 2023

tdrobbin commented Apr 13, 2020 •

edited

Loading