-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I convert a Jupyter notebook into a Kedro project? #2461
Comments
I like this! I have done it so many times :) |
I've dumped to the content here just to see what it looks like formatted in github. If you'd like to give feedback, I've created a box note to make ti easier :) https://mckinsey.box.com/s/xvzwguj8hy37436xdgtold0me5kaiya3 and the link to the companion jupyter notebook that tutorial refers to can be found here https://mckinsey.box.com/s/86r7op40jx9i3oxy89unvpjyqsmld7bm Have you just finished doing some modelling in a Jupyter notebook? Are you interested in how it might be converted into a Kedro project? Being new to Kedro, I started my learning with the Kedro Spaceflights tutorial. However, it wasn’t immediately clear how the different steps in the data science process I was familiar with mapped to Kedro. Hopefully, this step by step walkthrough helps. It starts with a Jupyter notebook of the Spaceflights project and walks through how we might convert it to a Kedro project, all while following the flow of a typical data science project.
Kedro SetupThis tutorial assumes you have Python 3.7+ and Set up a virtual environment for your projectIt is recommended to create a new virtual environment for each new Kedro project to isolate it’s dependencies from those of other projects. To create a new environment called
To activate the environment:
Install Kedro
Create new Kedro project
When prompted for a project name, enter
Now you are ready to start converting your Jupyter notebook into a brand new Kedro project! 1. Import DataIn our Jupyter notebook, we begin by reading in our data sources.
In Kedro, we store our csv files in Registering our DataFor our Kedro project to be able to access and use that data, we then have to ‘register our datasets’. It’s a little more complicated than calling In the
2. Process DataNow that our data is loaded, we can start processing it to prepare for model building. In our Jupyter notebook, we have 3 cells, fixing up different columns in each dataset. Then we merge them into one big table suitable for input into a model.
Creating NodesIn Kedro, each of these cells would become a function, which is then encapsulated by a Node to be placed in pipeline. A function, given an input (in this case, a dataset), performs a set of actions, to generate some output (in this case, a cleaned dataset). It’s behaviour is consistent, repeatable and predictable, putting the same dataset into a Node will always return the same cleaned dataset. In
As you can see, the contents of these functions can be copied directly from the relevant cells in our Jupyter notebook, but some minor improvements could be made to improve code quality. We extract some utility functions to reduce code duplication: (these can be pasted into the top of
Then we adjust our preprocessing functions to use these utility functions.
Assembling the PipelineNow that we have moved our data pre-processing functions into the Kedro framework, how do we tell it the order in which to execute them? In a Jupyter notebook, two functions in the same cell would execute one after the other. If they were in different cells, we could chose to run them in cell order, but also in any order if we run each cell manually. When it comes to data-processing, it is easy to see why executing cells in a specific order is important, as we do not want to construct The pipeline is assembled in the
This creates a pipeline that first calls the Persisting Output DatasetsEach node defines an output, in this case a processed dataset. If we want these to persist beyond each run of the pipeline, we need to register the dataset. This is similar to how we registered our input datasets in 1. Import Data We do this by adding them to the Data Catalog. In our
This registers them as Parquet Datasets and the outputs from the corresponding nodes will be saved into them on every run. If we choose not to register a dataset, the data will be stored in memory as temporary Python objects during the run and cleared after the run is complete. Creating the Model Input TableNext, we will use creating a node that outputs the model input table as an example to walk through all the steps we need to add a node to pipeline. 1. Create Node Function (
|
One thing to consider in this workflow: As a user, my starting point could be
(not all projects will have their requirements explicitly declared, but they will have implicit requirements nonetheless). If my
(which is JupyterLab, plus the Language Server Protocol extensions for JLab and Python, which add IDE-like features to JLab https://github.com/jupyter-lsp/jupyterlab-lsp) by default, kedro/kedro/templates/project/{{ cookiecutter.repo_name }}/src/requirements.txt Lines 6 to 8 in 7f44733
Reconciling these requirements is a bit of work, which I'm not sure people other than Python packaging nerds could successfully do. At the moment I don't have specific proposals on how to address this, but wanted to write down the problem anyway. Related: #2276 |
Just to comment that I've promised to take a look at the draft on Notion and feedback received (plus the work that @astrojuanlu on a similar project with a notebook using Polars, and the MLOps article shared on Slack) to work out next steps. I won't get to this until sometime in w/c 24/04 since I'm on leave after today. |
Reference to that project https://github.com/astrojuanlu/workshop-jupyter-kedro |
I'm closing this ticket now as we have a new direction for the post -- it'll be split into a series of 4 smaller posts and these will be published on the blog in turn. See kedro-org/kedro-devrel#80 for more detail. |
Description
This actions part of #410 and describes a workflow where users learn how to convert a Jupyter notebook into a Kedro project. The scope of this work includes:
Context
Users often ask this question, if they have existing notebooks and want to use Kedro as part of their refactoring cycle.
We have seen interest in this from the views on these videos:
The text was updated successfully, but these errors were encountered: