-
Link of the challenge: https://statistics-awards.eu/competitions/4
-
Link to the public repository: https://github.com/antoine-palazz/deduplication
-
Link to the Datalab that was used for the coding part of the challenge: https://datalab.sspcloud.fr/. For more information about the Onyxia project: https://github.com/InseeFrLab/onyxia-web.
-
For more information, you can contact Antoine Palazzolo, that represents team Nins.
This is a Kedro project, which was generated using Kedro 0.18.6
.
Take a look at the Kedro documentation to get started.
Two possibilities:
- Load manually the initial dataset wi_dataset.csv into the folder
data/01_raw/
(and the possible past approaches or submissions indata/09_past_approaches/
) - In the file
setup.sh
, change the s3 path to your datasetwi_dataset.csv
and possible past approaches to the problem, and uncomment the import.
To install the dependencies (and possibly import your data), run:
./setup.sh
To visualize the diverse elements of the code in a more interactive way, run:
kedro viz
You can run your Kedro project with:
kedro run
To run only the part that has been selected for the final submission, add --tags=final_models
. To better visualize the parts of the pipelines that have been filtered, you can apply the same filters on the visual representation generated by kedro viz
.
If you have several CPUs at your disposition and want to make the execution faster, you can run the following lines:
kedro run --tags=final_models_parallel_part --runner=ParallelRunner
kedro run --tags=final_models_sequential_part --runner=SequentialRunner
The final output will be stored in data/07_model_output/best_duplicates.csv
, and a description of the output will be available in data/08_reporting/best_duplicates_description.csv
.
To further analyze and possibly improve the output by using past approaches, you can then use the notebook stored in notebooks/use_past_approaches.ipynb
.
To generate or update the dependency requirements for your project:
kedro build-reqs
This will pip-compile
the contents of src/requirements.txt
into a new file src/requirements.lock
. You can see the output of the resolution by opening src/requirements.lock
.
After this, if you'd like to update your project requirements, please update src/requirements.txt
and re-run kedro build-reqs
.
Further information about project dependencies
Further information about building project documentation and packaging your project