Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: Group 15: Contraceptive Method Predictor Report #18

Open
1 task done
harryyikhchan opened this issue Nov 30, 2021 · 5 comments
Open
1 task done

Submission: Group 15: Contraceptive Method Predictor Report #18

harryyikhchan opened this issue Nov 30, 2021 · 5 comments
Assignees

Comments

@harryyikhchan
Copy link

harryyikhchan commented Nov 30, 2021

Submitting authors: @harryyikhchan @christopheralex @abhiket @valli180

Repository: https://github.com/UBC-MDS/contraceptive_method_predictor
Report link: https://github.com/UBC-MDS/contraceptive_method_predictor/blob/main/doc/contraceptive_method_predictor_report.md
Abstract/executive summary:
Here we attempt to build a classification model using the SVC classifier algorithm which can help predict the use of contraceptive of a woman based on her demographic and socio-economic characteristics. The target which was originally of 3 classes has been modified to 2 classes with target 1 defining usage of contraceptive (including short term and long term) and 0 defining no usage of contraceptive.

Our model performed fairly well on unseen data , with an overall accuracy of ~ 74% and the area under the curve (AUC) is of 78%. However the model still has a few false predictions for the non usage of contraceptive. These cases where false positives, that is predicting the usage of contraceptive when in fact the person does not use contraceptives. These kind of predictions give wrong insights of contraceptive usage, thus we feel further work to improve model prediction is needed before we could put this model in the real world.

Editor: @flor14
Reviewer: Affrin Sultana, Samuel Quist, Rong Li, Cuthbert Chow

  • I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.
@cuthchow
Copy link

cuthchow commented Dec 3, 2021

Data analysis review checklist

Reviewer: @cuthchow

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  • In general, the structure of the repository was well thought out, and files/folders were placed in logical locations. It was easy to understand the steps of research, and what was being done by the scripts at each stage.
  • However, I tried running the usage steps as indicated in the README, but was unable to run it to completion. The EDA figure output required the altair_saver package, which appears to not have been included in the environment file (this is based on the 0.1.0 milestone version, although this appears to have been addressed in later versions). Moreover, the paths provided as input scripts for some of the steps appear to be incorrect. For instance, preprocess_model_selection.py was referenced, but the actual file was located inside the /src folder
  • Although the required dependencies were included in an environment file, it might be better to list the dependencies in the README as well so it is easier for viewers to understand which dependencies are being used by the project.
  • Although the scripts were mostly functioning, it may be helpful to include print messages in the scripts when their operation is successful, to indicate what has just been done by the script.
  • The research question is well stated and elaborated, and a good explanation was given about the background, context and importance of the question.
  • Good use of tables in the report in order to clarify important information in a concise manner (e.g Which transformation were being applied to which columns and column types), although it may be informative to indicate why certain transformations were being applied, rather than just which ones.
  • Similarly, the methodology could be further elaborated upon, explaining why particular models and evaluation metrics were being used for this particular task
  • The explanation of the ROC curve is a bit lacking in the report, and should ideally explain the significance of the AUC value with reference to the problem at hand.
  • Although the research question being asked pertains to prediction, it may still be helpful to include a section discussing the feature importances, in order to better understand which specific factors affect the prediction of the models, as the question you posed specifically references the use of 'demographic and socio-economic status' as predictors.
  • I would further recommend a short ethical discussion regarding the inclusion of certain factors in the predictive task (i.e religion)
  • Overall, the research question and scope were well specified, the report did a great job of addressing the key questions being asked, and barring a couple of issues with the usage, the code was easy to use and follow along with. Good work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@lirnish
Copy link

lirnish commented Dec 4, 2021

Data analysis review checklist

Reviewer: @lirnish

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5h

Review Comments:

  • Although the example usage is clear, I was still unable to create the conda environment using conda env create -f env-cmp.yaml. I believe it is because of, for example,- catboost==1.0.3 =py39h6e9494a_1 this part, which makes the environment not across platforms. Please see the details here.
  • When I am in the root directory, I was unable to run the command according to the usage:
    • pre-process data and train model
    • python src/preprocess_model_selection.py --path="../data/processed/train.csv" --score_file="../results/val_score_results.csv" --model_path=results/models/final_svc.pkl
    • Instead, this one works: - python src/preprocess_model_selection.py --path="data/processed/train.csv" --score_file="results/val_score_results.csv" --model_path=results/models/final_svc.pkl
    • It is a bit counterintuitive that we need to go into a subdirectory to run this command
    • Similarly, 'test model' command also have this issue
  • At the first glance, I was unable to find the final report as there is no direction in the README file. I would recommend adding a link to your report in the README file so that it is easy to find.
  • The final report is well structured and I especially like the graphs used, it clearly presents the results.
  • I agree with Cuth's review that the methodology used in the data analysis could be further justified. Maybe cite some external sources to explain why the particular models/scoring matrics were being used.
  • One thing I am a little concerned about is the changing of the target from 3 categories to 2 categories. As the introduction suggests, one of the importance of this question is 'the adverse effects contraceptive’s can have on a person’s health based on the usage'. If the target changes and the model is no longer trying to differentiate between the long-term and short-term usage, the introduction part may need a revision.
  • It is understandable that some short scripts like split_data.py and download_data.py may look straightforward, but I would still recommend adding some documentation there, because you may check this several years later and the documentation could help your understanding.
  • Just a small thing: Maybe re-structure the Contributing.md file a bit? Some of those don't seem to belong to 'Fixing typos'.
  • In general, this project is well organized. It is easy to follow the developer's steps and understand what is going on there.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@squisty
Copy link

squisty commented Dec 5, 2021

Data analysis review checklist

Reviewer: squisty

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelines: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

  • I was unable to run the make all command, running into an issue with not have reticulate installed. If there was a list of the R dependencies in the README then we could see which packages we need that are not included in the environment file.
  • I did not notice any tests for ensuring that the functions will not run with the incorrect type of inputs, thats an easy fix by adding in some assert statements.
  • We are told that the model will make predictions but are not told exactly how it works (I'm not sure how in-depth this would be, maybe just some quick overview of classifiers?), or what a different method for answering this question could be.
  • I love the layout of the final report, it looks fantastic on GitHub although there are a few references that aren't working as intended (@ref(fig:histTarget)), if this is just a GitHub issue then feel free to ignore this.
  • I think it could be clearer that this analysis is for married women only, I believe it is only mentioned in passing in a section of the final report apart from the data descriptions. Also, the data is only from a single country so maybe that could be mentioned too. The project has a big question being asked and I think specifying the demographic of data being used could help further understanding.
  • There is a section in the final report (part 7.1) where the confusion matrix for the model is discussed. I think it could be a good idea to address the amount of false positive and false negatives and what they mean for the study.
  • Also in section 7 of the final report, the statement about improving the model was a good addition as users may not have a good reference as to what "decent" machine learning scores are. Saying that there is improvement to be done gives context to the score values that are otherwise a bit difficult to understand from the outside.
  • Good repo organization, everything looks clean and professional.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@Affrin101
Copy link

Data analysis review checklist

Reviewer: <GITHUB_USERNAME>

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelines: Does the code adhere to well-known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 h

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  • I tried to create the environment using the conda create command I was not able to do it, neither was I able to run make all.

  • I believe, although the command to run make all has been provided in the Usage section, but during instances like these, when make all doesn't work, alternatively, the commands to run individual files could have been provided in the USAGE section to help the reviewer run the files

  • Since they are using R studio and various r packages like knitr, kable , I guess those could have been added in the dependencies section and in the References as well.

  • There are a lot of features in this topic that might need ethical discussion like religion

  • In the introduction: "Here we approach this problem by using machine learning algorithm to predict a contraceptive method
    preferred by the individual given the women’s demographic and socioeconomic status."
    But I guess that is not being predicted here, the problem statement is predicting whether contraceptive is used or not
    not the contraceptive method being used (if the language could be improved in the introduction, it would be nice). A similar change can be done to the x-axis title of the first plot where the distribution of the classes has been shown.

  • In Section 6 of the report: Before Finding the best model, some background behind choosing the four predictive models (why) in the Method section could have been added and It looks a little abrupt now

  • In Table 6.1 the column names look a bit off (contains . in between words). Few names of the models are in lower cases, very minor changes are needed here.

  • In the README.md, a direct link to the final report has been provided. To give this a finished look, instead of adding the link, it can be embedded in a word maybe.

  • Additionally, although the Area under curve plots are very well-plotted, they can also work plot the ROC_AUC curve, AP Curve for different threshold values and plot it for the best threshold

  • In the conclusion section, they can add a little more emphasis on how the results obtained ties back to their research question, use of contraception to socio-economic and demographic status.

  • I ran each script after looking at the commands for individual scripts in the makefile, all the scripts ran well, created plots as expected and the final report got rendered as well

  • The tables are very well shown with all the required metrics for the model

  • Great work done on the EDA part!!

  • Project was very well organized and well structured.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@abhiket
Copy link

abhiket commented Dec 11, 2021

Thank you for the comments! we really appreciate your feedback

  • A lot of concerns, were raised about "using religion" as one of the model features. We accept this fact and hence we have added disclaimers. This can be found in the commit: 70e53e4
  • As mentioned by @squisty : we have made the changes everywhere and have brought it out that the sample was of married women of Indonesia and was collected way back in 1987. We have to be cautious when using this model in the present world. The same can be found in the commit aa74577
  • Regarding the comment from @cuthchow point number 4: we have included print messages in the scripts. This can be found in commit: 5bd8e11
  • All the grammatical errors, typos, and reproducibilities issues highlighted above have been taken care of in this final version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants