-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: how to handle overlap in example spaceflights projects? #2874
Comments
The different starters we need are:
I've mapped out the differences between these various projects, the green highlighting means a change is required in the file, the ⭐️ indicates is a new file that needs to be added. Spaceflights Pandas -> Spaceflights Pyspark├── conf
│ ├── base
+ │ ├── catalog.yml
+ │ ├── spark.yml ⭐️
│ │ ├── parameters.yml
│ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
│ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ ├── __init__.py
│ │ ├── main.py
+ │ ├── hooks.py ⭐️
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml Spaceflights Pandas -> Spaceflights Databricks├── conf
│ ├── base
+ │ ├── catalog.yml
+ │ ├── spark.yml ⭐️
│ │ ├── parameters.yml
+ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
│ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ ├── __init__.py
│ │ ├── main.py
+ │ ├── databricks_run.py ⭐️
+ │ ├── hooks.py ⭐️
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml Spaceflights Pyspark -> Spaceflights Databricks├── conf
│ ├── base
+ │ ├── catalog.yml
│ │ ├── spark.yml
│ │ ├── parameters.yml
+ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
│ │ │ │ ├── nodes.py
│ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
│ │ │ │ ├── nodes.py
│ │ │ │ ├── pipeline.py
│ │ ├── __init__.py
│ │ ├── main.py
+ │ ├── databricks_run.py ⭐️
│ │ ├── hooks.py
│ │ ├── pipeline_registry.py
│ │ ├── settings.py
│ ├── tests
│ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml Spaceflights Pandas -> Spaceflights Pandas VizViz features added: experiment tracking, plotting with Plotly, and plotting with Matplotlib ├── conf
│ ├── base
+ │ ├── catalog.yml
│ │ ├── parameters.yml
│ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
+ │ │ ├── reporting ⭐️
+ │ │ │ ├── nodes.py ⭐️
+ │ │ │ ├── pipeline.py ⭐️
│ │ ├── __init__.py
│ │ ├── main.py
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml Spaceflights Pyspark -> Spaceflights Pyspark Viz├── conf
│ ├── base
+ │ ├── catalog.yml
│ │ ├── spark.yml
│ │ ├── parameters.yml
│ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
+ │ │ ├── reporting ⭐️
+ │ │ │ ├── nodes.py ⭐️
+ │ │ │ ├── pipeline.py ⭐️
│ │ ├── __init__.py
│ │ ├── main.py
│ │ ├── hooks.py
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml Based on the above, the only obvious merging of projects I see is with the Pyspark and Databricks examples. The other combinations require a lot of changes and the reduction we'd get in maintenance burden for the starters would be added complexity in logic on how to pull in the correct examples for users in the |
Based on these findings - would you recommend merging the Pyspark and Databricks example then? What projects would we expose to the users in the new Starter repo. When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose. How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before. |
Yes, that one is easy to merge and also de-duplicate.
I think that's basically the "vanilla" spaceflights. If we ever create a new starter it should just be based on that. |
ConclusionMy recommendation is to only merge the pyspark and Databricks starter and keep the rest separate. This means we need to create:
|
Description
Follow up on #2758 and #2838
We should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?
Context
In #2838 we'll add several new spaceflights based projects that will also serve as the examples a user can add to a project at creation with the new utilities flow. These examples will all likely have similar files, so the question is do we need to have each complete project or can we somehow combine them and still serve the purpose of providing users different examples?
Possible Implementation
The aim of this spike is to come up with possible implementations for merged examples.
The text was updated successfully, but these errors were encountered: