Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modular Pipeline breaks when external outputs are used as inputs to the same modular pipeline and another modular pipeline #1105

Closed
1 task done
rashidakanchwala opened this issue Sep 29, 2022 · 16 comments · Fixed by #1651
Assignees
Labels
Issue: Bug Report Python Pull requests that update Python code

Comments

@rashidakanchwala
Copy link
Contributor

rashidakanchwala commented Sep 29, 2022

Description

If an output from a modular pipeline is used both as an input to a node in the same modular pipeline or as an input to a node in another modular pipeline then Kedro-viz no longer recognises it as an external output to the modular pipeline.

Context

User @david-zihao-xu reported this issue on his Kedro-project.

See a visualization of the problem here.

Steps to Reproduce

  1. On demo project, notice the 'Prm Spine Table' does not have any input. When you expand it, you can see it is an output of a node from the modular pipeline - Ingestion. But when you collapse it again, it has no connection with ingestion.
  2. In the demo-project/src/pipeline/data-ingestion/pipeline.py remove the lambda function which uses 'Prm_spine_table' as an input.
  3. Once that is removed it all works fine. Ingestion is now having Prm Spine Table as output.

the above shows that this only happens when an internal output (output created inside a modular pipeline is used as both internal input and external input to another modular pipeline.

Expected Result

This needs some technical discussion because 'Prm_spine_table' should ideally be an external output to modular pipeline "ingestion" but if it's outside then how it will be an input to the same modular pipeline. This is where it becomes circular interaction and it's not clear how we should represent it.

Checklist

  • Include labels so that we can categorise your issue
@limdauto
Copy link
Collaborator

@tynandebold @rashidakanchwala iirc it was a deliberate decision to throw away edges to avoid cycles. If you scroll to the Edge cases section in this PDF, I noted a couple of them down: https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/docs/assets/expand_collapse_modular_pipelines_presentation.pdf

@rashidakanchwala
Copy link
Contributor Author

thanks @limdauto , this is so helpful. I remember something around these lines being discussed back then but could not find which documentation had it.

Also FYI - @jmholzer

@tynandebold
Copy link
Member

Notes from Technical Design session:

  • The current implementation seems to be removing the inbound edge to a node instead of the outbound
  • Let's investigate in the code how it's being done and play around with removing the other edge
  • We should aim to work out the possible combinations of these happenings and see what will be the best solution

@NeroOkwa
Copy link
Contributor

NeroOkwa commented Nov 25, 2022

Context

This pain point was also an output from the Kedro-Viz adoption synthesis #987

Sometimes in the flowchart, the layers can't be activated because of circular dependency. This makes it harder to navigate
your flowchart in Kedro-Viz.

Supporting quotes

  • “Sometimes on the layer level we have, we run into this that the layer can't be activated because of circular dependencies. I think there is an open issue for that, but that has been (especially in like big projects) a pain points sometimes because the graph that you then end up with is really much harder to navigate as opposed to with the layers that, that we created in the catalogs. And yeah, just being able to debug which data set is causing the issue would already be incredibly helpful”.

@DrDaDe
Copy link

DrDaDe commented Jan 24, 2023

I push this issue and provide another simple minimal working example:

pipeline code:

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:

    sub_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
        ],
        inputs={"dataset_1"},
        outputs={"dataset_3"},
        namespace="sub_pipeline"
    )

    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            sub_pipeline,
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_out",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "dataset_3"}
    )
    return new_pipeline

catalog:

dataset_in:
  type: json.JSONDataSet
  filepath: data/dataset_in.json

dataset_out:
  type: json.JSONDataSet
  filepath: data/dataset_out.json

dataset_in.json:

{
  "key": "value"
  }

Expanded Pipeline (correct)
image

Collapsed Pipeline (incorrect - dataset_3 should be output of overall pipeline)
image

@DrDaDe
Copy link

DrDaDe commented Oct 6, 2023

Hi @tynandebold ,

I wanted to re-raise this issue.

I was very happy when I saw that you fixed "incorrect rendering of datasets in modular pipelines." in the context of #1123 and #1439 and I went back to my old minimal working example to check whether it works now.

Unfortunately, it still does not.

Same code as before:

Expanded:
image
Collapsed:
image

Dataset 3 is still not connected to the collapsed pipeline even though it is explicitly listed as an output of that pipeline.

I actually don't even need a nested scenario for that. Removing " namespace="sub_pipeline" produces the same error: I get a properly connected queue of tasks/datasets if "main pipeline" is expanded, but a fully disconnected 'dataset_3' if the main_pipeline is collapsed.

I find it kind of strange that even after all the effort you put in to solve #1123, this seemingly trivial case still produces errors. But at the same time I carry high hopes that with all the learnings from 0.6.4, you may be able to fix this easily after fixing #1123.

For me, unfortunately, fixing this error is still vital because my steadily growing production pipeline has several datasets which are used at multiple steps in the pipeline and I still cannot properly modularise the pipeline without breaking the visualisation.

Cheers

@tynandebold
Copy link
Member

Thanks for raising this again @DrDaDe.

I've added it to our next sprint, which starts next week on Monday. We'll have a look then.

@tynandebold tynandebold moved this from Backlog to Todo in Kedro-Viz Oct 16, 2023
@rashidakanchwala rashidakanchwala moved this from Todo to In Progress in Kedro-Viz Oct 19, 2023
@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented Oct 19, 2023

hi @DrDaDe - this is a known issue on Kedro-viz and we currently do not have a way to resolve this.

This is because dataset_3 acts as both an input and an output of main_pipeline. This creates a cyclic dependency and as cycles are not allowed in DAGS - we remove these edges.
We are working on having a better Kedro-viz error handling that informs users of this if it happens.

@tynandebold tynandebold moved this from In Progress to Done in Kedro-Viz Oct 20, 2023
@tynandebold tynandebold moved this from Done to In Progress in Kedro-Viz Oct 20, 2023
@DrDaDe
Copy link

DrDaDe commented Oct 27, 2023

Hi @rashidakanchwala,

Thank you for giving an update on this matter.

However, I have to disagree with the cyclic dependency problem as a general explanation. Again, here is my example from above, simplified as much as possible:

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:


    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_out",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "dataset_3"}
    )
    return new_pipeline

it produces the exact same graphs as already provided in my earlier message. Especially, dataset_3 gets disconnected if the main pipeline is collapsed.

Clearly, there is no cyclic dependency at all in this pipeline, neither on the expanded nor at the collapsed level. At least not logically.

If, as you suggest, the edge is removed due to a cyclic dependency detection, then there must either be an algorithmic error in that detection or another artifact in the code which involuntarily created a cyclic dependency even if logically there is none.

But to be honest, I believe the root of this problem lies somewhere else.

@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented Oct 27, 2023

@DrDaDe - this is a correct DAG in Kedro. However the cyclic dependency is introduced during the collapsed pipeline view in Kedro-viz

dataset3 is an output of main_pipeline.step3
dataset3 is an input of main_pipeline.step4

in a collapsed view when you cannot see both step3 and step4 -- a cycle is introduced in the DAG where dataset_3 becomes both input and output to main_pipeline.

Please see more details on the edge case section of this pdf - https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/docs/assets/expand_collapse_modular_pipelines_presentation.pdf

cc - @noklam

This is a known issue but it would be interesting to understand from you how you would be able to visualise this in a collapsed view.

@tynandebold tynandebold moved this from In Progress to Backlog in Kedro-Viz Oct 30, 2023
@DrDaDe
Copy link

DrDaDe commented Nov 1, 2023

Hi @rashidakanchwala,

first of all: Many thanks for the explanation and link to the source. I understand the issue a little better now and maybe we can sort this out faster:

What I agree with:

  • Dataset 3 is definitely considered an output of the modular pipeline, because I explicitly declare it to be.
  • If a dependency cycle occurs, edges have to be removed to be able to construct a tree view.
  • If dataset 3 is considered an input for the collapsed modular pipeline, a cycle occurs. (But mind the if, see below).

What I don't agree with

  • dataset 3 should never be considered an input for that pipeline. Neither do I declare it to be, nor does it make sense logically: An input to a pipeline must not and cannot be something the pipeline itself produces.

If you could tell me the part of the code which determines the inputs and outputs for the modular pipeline, I might be able to investigate further myself (I was unable to dig through the code by myself as I am not very good in understanding frontend project code).

Hence I can only vaguely guess what the issue is: I believe the algorithm creates dataset 3 as an external node because I declare it as an output. If the algorithm then determines if any external output node (including the one I just created) is used as any input for any of the internal nodes of the pipeline, then it will declare dataset3 as an input node and we have a problem.

From this random guess and not knowing what the actual algorithm is, I propose two possible solutions:

  • If a dataset is the output of any node inside a modular pipeline. then it cannot be an input dataset of that modular pipeline.
  • If a dataset is the output of a modular pipeline, then it cannot be an input dataset of that same modular pipeline.

Let me know if I can be of any more help. Also, if you could tell me which part of the code determines the inputs and outputs of the collapsed pipeline, I'd be happy to check and test for myself. It would also be interesting if you could provide counter-examples to my proposal above, i.e. examples where it is important that internal datasets are considered as input datasets even though not explicitly declared as such.

Addendum: Another thought of mine: Why is it that the algorithm believes dataset3 is an input but not, e.g., dataset2? The fact that I declare dataset3 as "output" seems to label it as a potential "candidate" for an input node. Proposal 2 should therefore straightforwardly fix this.

@DrDaDe
Copy link

DrDaDe commented Nov 1, 2023

If anyone stumbles across here: A colleague of mine found a nice workaround:

If I have a dataset like dataset3 (I'll call it internal_dataset) which is both an external output and an "internal input", I can circumvent the problem as follows:

  • add a pseudonode with a trivial lambda function which maps the input to itself
  • use internal_dataset as an input and define a new dataset external_dataset as output of that node
  • use external_dataset as the output of the modular pipeline.

Now the ambiguity is broken: internal dataset is purely internal (produced and consumed), external dataset is purely external (only produced and available as an external node)

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:


    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_1",
                 name="step1"),
            node(lambda x: x,
                 inputs="dataset_1",
                 outputs="dataset_2",
                 name="step2"),
            node(lambda x: x,
                 inputs="dataset_2",
                 outputs="dataset_3",
                 name="step3"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="pseudodataset_3",
                 name="pseudostep3"),
            node(lambda x: x,
                 inputs="dataset_3",
                 outputs="dataset_out",
                 name="step4"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "pseudodataset_3"}
    )
    return new_pipeline
grafik grafik

Grain of salt: The additional pseudonode may create confusion when reading the pipeline. In my case, for example, my trivial chain of nodes now becomes an actual tree with a branch.

@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented Nov 2, 2023

Let me know if I can be of any more help. Also, if you could tell me which part of the code determines the inputs and outputs of the collapsed pipeline, I'd be happy to check and test for myself. It would also be interesting if you could provide counter-examples to my proposal above, i.e. examples where it is important that internal datasets are considered as input datasets even though not explicitly declared as such.

@DrDaDe - Thanks for the above suggestions and workarounds :) Your thoughts are very helpful. We will investigate the above in our next sprint. Also here's where the code resides - https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/managers.py#L389

Why is it that the algorithm believes dataset3 is an input but not, e.g., dataset2? The fact that I declare dataset3 as "output" seems to label it as a potential "candidate" for an input node. Proposal 2 should therefore straightforwardly fix this.

That is because when dataset3 becomes an external output to the modular_pipeline - in the collapsed mode it lies outside the modular pipeline whereas dataset2 resides inside the modular pipeline hence it is a hidden when the modular pipeline is collapsed.

@DrDaDe
Copy link

DrDaDe commented Nov 2, 2023

Hi @rashidakanchwala

First of all: My minimal working example has now boiled further down to this:

from kedro.pipeline import Pipeline, node, pipeline

def create_pipeline(**kwargs) -> Pipeline:


    new_pipeline = pipeline(
        [
            node(lambda x: x,
                 inputs="dataset_in",
                 outputs="dataset_middle",
                 name="step_in"),
            node(lambda x: x,
                 inputs="dataset_middle",
                 outputs="dataset_out",
                 name="step_out"
            )
        ],
            namespace="main_pipeline",
        inputs=None,
        outputs={"dataset_out", "dataset_middle"}
    )
    return new_pipeline

I tried running my pipeline through the create_modular_pipelines_tree_for_registered_pipelinemethod you referenced and check what happens:

Here is the list of "inputs", "outputs", "external inputs" and "external outputs":

inputs: ['main_pipeline.dataset_in']
outputs  ['dataset_out']
external_inputs: ['dataset_middle']
external_outputs: ['dataset_out', 'dataset_middle']

Node that the cycle-detector looks for the "descendants" which are as follows:

descendants = nx.descendants(digraph, modular_pipeline_id)
bad_inputs = modular_pipeline.inputs.intersection(descendants)

Here are the descendants, the graph produces

['dataset_out']

Funnily, since this does not intersect with "inputs", nothing gets removed! And I actually observed while debugging, that the "cycle node remover" code is never executed.

So as a fun summary: What you suspected indeed happens: dataset_middle is considered an "external_input" but not an input. Still, it has nothing to do with the code you referenced, because the cycle detector looks for overlaps between inputs and descendents, not external inputs.

To be honest, I am even more confused than before but I hope maybe we can get closer to the core of the problem :-)

My followup questions would be:

  • which earlier part of the code decides that "dataset_middle" is an input? (Because this is where I disagree)
  • which later part of the code translates this pipeline into the 'collapsed view'?

@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented Nov 8, 2023

@DrDaDe , you are right. On further investigation based on your above point we realised that this part of the code cancels external_inputs and external_outputs and this prevents dataset_middle to be an output of the modular pipeline.

This in indeed a bug! We are planning to address this issue in our upcoming sprint and will keep you updated once the fix is released. Thank you for bringing this to our attention!

@rashidakanchwala rashidakanchwala added Python Pull requests that update Python code and removed Technical Design labels Nov 9, 2023
@rashidakanchwala rashidakanchwala moved this from Backlog to Todo in Kedro-Viz Nov 13, 2023
@rashidakanchwala rashidakanchwala moved this from Todo to In Progress in Kedro-Viz Nov 14, 2023
@rashidakanchwala rashidakanchwala moved this from In Progress to In Review in Kedro-Viz Nov 21, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro-Viz Nov 22, 2023
@rashidakanchwala
Copy link
Contributor Author

@DrDaDe - The fix for this issue has been released in the latest Kedro-viz 7.0.0. Thank you so much for your support and relentless pursuit to have this issue fixed! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug Report Python Pull requests that update Python code
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants