Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: Group 13: Predicting Wine Quality #20

Open
1 task done
NikitaShymberg opened this issue Nov 30, 2021 · 5 comments
Open
1 task done

Submission: Group 13: Predicting Wine Quality #20

NikitaShymberg opened this issue Nov 30, 2021 · 5 comments

Comments

@NikitaShymberg
Copy link

Submitting authors: @NikitaShymberg @gutermanyair @aldojasb @SonQBChau
Repository: https://github.com/UBC-MDS/predicting_wine_quality
Report link: https://github.com/UBC-MDS/predicting_wine_quality/blob/main/doc/Quality_white_wine_predictor.pdf
Abstract/executive summary:

This report uses the white wine database from "vinho verde" to predict the quality based on physicochemical properties. Quality is a subjective measure, given by the average grade of three experts.

Before starting the predictions, the report performs an exploratory data analysis (EDA) to look for features that may provide good prediction results, and also makes an short explanation about the metrics used in the models. In data preparation, the database are downloaded and processed in python. In this phase, the training and testing sets are created and they will be used during the model building.

There's a brief explanation of the models used in this report. Other important machine learning concepts, such as ensemble and cross validation, are also discussed.

The results section presents the best model for predicting quality and discuss why it was chosen for this purpose.

Editor: @flor14
Reviewer: <VORE_Margot> <Owoseni_Taiwo> <Nguyen_Nobby>

  • I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.
@thayeylolu
Copy link

thayeylolu commented Dec 2, 2021

Data analysis review checklist

Reviewer: @thayeylolu

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelines: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour 20 mins

Review Comments:

(Note: This comments are time-based)

  1. I will suggest using back ticks to enclose code snippets of the usage section in the Readme.md to improve readability.
  2. I commend you on the documentation (doc string of your functions). It looks neat. I noticed that the split.py isn't check to verify that your expected input file is a .csv file. What happens when it is not a .csv file. I would suggest you write a check (a try and except ) to catch unexpected cases and inform the user to use a .csv file.
  3. I suggest listing the dependencies in the readme.md
  4. I noticed there is no check to verify that the .csv file has the expected column names as used in your analysis. I will suggest writing a check to verify the input csv has the expected data column.
  5. I will like to suggest including a title for the last three plots in the eda.ipynb.
  6. I suggest showing the relationship between features as well in your eda. Perhaps an heatmap.

Analysis report

  1. I will like to suggest listing the authors of the report.
  2. Your project tells us what it aims to do, but it does not have a research question.
  3. I will suggest including figure captions
  4. I will like to suggest including a reference to a paper that talks about quality of wine. I could not find it in your reference.
  5. It also does not tell us background information or the importance of the research question about this study. What materials have attempted similar study?

Kind remarks: There may be typos here 😄 .

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@voremargot
Copy link

Data analysis review checklist

Reviewer: @Vorem

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • [] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • [ x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • [] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
    • The split.py script has no tests for its functions

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
    • The authors are missing in the PDF version of the reprort! Make sure you give yourselves credit for the work!
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
    • There is little discussion about what your model results mean, how the other models you tried performed, or what the best hyperparameters were.
    • Wish there was more focus on the model building and less on the EDA plots
  • Conclusions: Are the conclusions presented by the authors correct?
    • You need to show how other models perfromed to prove to the reader that the model you created is really the best.
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1hr

Review Comments:

OVERALL COMMENTS:

  1. Overall, I really like your project. You did a great job at describing how the problem you were solving was important and how your findings could be utilized. The quality of wine did not seem important to me, but you made it very clear how your work would be impactful to the industry.

  2. The final report would be stronger if it included more specific details. When reading, it felt like the modeling results were overlooked which is one of the most interesting part of your project. There were a lot of general statements about the models and methods being used but what I would have found more interesting were details about which tuned parameters were the best, how all the model scores compare, and what specifics you decided on for data cleaning. I was not convinced your final model was the best because I had no data to tell me it was better than other models you tried. Remember that your reader likely has a background with data analysis methods and they are going to your report to look for specific on what was done and what you found.

  3. It would be good to check over your rendered final report. A bunch of the figures looked like they were cut off and it would be helpful if you added captions. Also, you worked so hard on this so make sure you put your names in the report!

  4. I really liked how the ReadMe was set up with clear sections. It made the document clear and easy to follow. In the usage section, adding some style formatting might make it clearer as to what is code and what are comments.

  5. You clearly did a lot of work on the model. The model script is very easy to follow and shows all the hyperparameter tuning that went into finding the best model for the problem. It would be really nice to see a summary of this work in the final report. I think it would make it clearer that the model you chose was the best. You put a ton of work into your model so it would be nice to showcase it more in the report!

  6. Adding more specifics to your summary would be benifical. As the reader I want to immediatly know some specifics about your EDA, hyperparameter tuning and most importantly, your results. At least in the scientific community, it is expected that reading the summary/abstract should tell you a condensed version of what was done, what the findings were, and what conclusions you came to.

SMALL ERRORS I NOTICED:

  1. In the documentation for EDA.py it would be helpful to mention specific figures are being outputted so the reader doesn't have to search through the code to find out.
  2. The preprocess script does not have any documentation or usage instructions in it.
  3. You have 4 files in your raw data folder but you only mention two of them in the report/readme. What are the others?
  4. The figures need captions in your report.
  5. Figures are being cutoff in the PDF report and are missing in the markdown file

@nobbynguyen
Copy link

nobbynguyen commented Dec 3, 2021

Data analysis review checklist

Reviewer: @nobbynguyen

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Overall, I enjoy reading your interesting data analysis. I am impressed by how you challenged yourself to work with different algorithms to perform the task. However, in my opinion, there are still rooms for improvements as follows:

  • It is great that you have a environment.yaml file in your root folder, it would also be nice if you can instruct how to install the environment in README.md file.
  • There are 6 scripts in src folder but only 5 scripts are use in README.md file.
  • The name of the scripts are not consistent compared to README.md file. For example, in src folder, the script's name is analyze.py, while in README.md file it is analysis.py. The same for eda.py script.
  • I agree with @thayeylolu and @voremargot that the graphs used (especially in the report) could be made more readable by including titles or figure captions. In addition, the plots did not show fully in the report. There are only 6 plots, and I believe you guys have more variables than that.
  • In the Methods part of the final report, you indicate that your task is to focus on what white wines feature are important to get the promising result. It would more clear if you could summarise your conclusion of feature importance in this part, which features are choosen, and why.
  • Related to Methods: References can be made in this part, for example sklearn.
  • In the Results and Discussion part of the final report, it could be great if the results of other models are also displayed to prove that how better KNN model is. The best parameter has not been discussed yet.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@katerinkus
Copy link

katerinkus commented Dec 5, 2021

Data analysis review checklist

Reviewer: @katerinkus

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • [] Installation instructions: Is there a clearly stated list of dependencies?
  • [] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

Could not complete this part

  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.75h (changed from 1.25)

Review Comments:

README and folder organization

  • A bit unclear how to find dependencies. My first instinct was to search for "Dependencies" on the page and then noticed the environment.yaml file. Would be easier to see this information in the README under Dependencies.
  • Could not locate the command for installing .yaml file (i.e. conda env create -f environment.yaml). People may not know how to run these commands.
  • I am curious to know how the data is collected or what your assumptions were in regards to data collection (i.e. are these randomly collected independent observations?)
  • Nitpicking comment: found a typo -- "Navie" instead of "Naive".

Replicating the project

  • As others have pointed out, I would include the conda activate somelier in the instructions. If someone has not used it before, they may be unsure what to do. The same goes for deactivation process after the project is done.
  • I really appreciated the message describing what is happening (i.e. "Starting to train"). It keeps the user engaged. Otherwise, it would have been just the fan noise to keep me company.

The report

  • I found model descriptions well-written. Even if someone did not use them before, they could understand what they do from their description.
  • The graphs looked great. My suggestion would be to do violins instead of box plots in case they hide some kind of bimodal or other distribution. But it may be excessive.
  • I would elaborate on the metrics. I was not sure what mae stands for, for example.
  • Otherwise, the report was clear and concise.

Side note regarding Make

  • I wanted to try Make file just for fun but I struggled to replicate the project in this way. You can see what I got below. I replaced my user name with --- and the folder path with the word mypath for privacy reasons. This was not listed in the instructions and I believe is unnecessary for this review but maybe it could he helpful for you:
$ make all
python src/download_data.py --url=http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip --path=data/raw/
Data downloaded to data/raw/
python src/split.py data/raw/winequality/winequality-white.csv data/processed
python src/ml_models.py data/processed results/raw_results
Starting to train Dummy...
C:\Users\---\miniconda3\envs\somelier\lib\site-packages\sklearn\model_selection\_search.py:292: UserWarning: The total space of parameters 2 is smaller than n_iter=10. Running 2 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
Finished training Dummy!
 ------------
Starting to train Ridge...
C:\Users\---\miniconda3\envs\somelier\lib\site-packages\sklearn\model_selection\_search.py:292: UserWarning: The total space of parameters 6 is smaller than n_iter=10. Running 6 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
Finished training Ridge!
 ------------
Starting to train Random Forest...
Finished training Random Forest!
 ------------
Starting to train KNN...
Finished training KNN!
 ------------
Starting to train Bayes...
Finished training Bayes!
 ------------
Starting to train SVM...
Finished training SVM!
 ------------
python src/analyze.py --r_path=results
best_model.csv created at location /results/
python src/EDA.py data/processed/X_train.csv data/processed/y_train.csv results
Traceback (most recent call last):
  File "C:\Users\---\mypath\predicting_wine_quality\src\EDA.py", line 93, in <module>
    main()
  File "C:\Users\---\mypath\predicting_wine_quality\src\EDA.py", line 44, in main
    chart1.save(
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\vegalite\v4\api.py", line 476, in save
    result = save(**kwds)
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\utils\save.py", line 112, in save
    mimebundle = spec_to_mimebundle(
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\utils\mimebundle.py", line 60, in spec_to_mimebundle
    return altair_saver.render(spec, format, mode=mode, **kwargs)
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\_core.py", line 257, in render
    mimebundle.update(saver.mimebundle(fmt))
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_saver.py", line 90, in mimebundle
    bundle[mimetype] = self._serialize(fmt, "mimebundle")
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_node.py", line 114, in _serialize
    spec = self._vl2vg(spec)
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_node.py", line 68, in _vl2vg
    return json.loads(vg_json)
  File "C:\Users\---\miniconda3\envs\somelier\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Users\---\miniconda3\envs\somelier\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\---\miniconda3\envs\somelier\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)
make: *** [Makefile:17: relationship_between_individual_features_and_the_quality_3.png] Error 1

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@NikitaShymberg
Copy link
Author

Thank you all for the feedback!

  1. Regarding the comments about missing authors here here and here we have added a list of authors to the report in this commit
  2. Regarding this comment about needing more info about training the model, we added this info in this commit
  3. Regarding this comment, we added a heatmap to our EDA in this commit
  4. Regarding this comment, we proofread the document and fixed the grammar and spelling errors in this commit
  5. Regarding this comment we added usage instructions in this commit
  6. Regarding this comment, we added figure captions in this commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants