Submission: Group 13: Predicting Wine Quality #20

NikitaShymberg · 2021-11-30T17:39:15Z

Submitting authors: @NikitaShymberg @gutermanyair @aldojasb @SonQBChau
Repository: https://github.com/UBC-MDS/predicting_wine_quality
Report link: https://github.com/UBC-MDS/predicting_wine_quality/blob/main/doc/Quality_white_wine_predictor.pdf
Abstract/executive summary:

This report uses the white wine database from "vinho verde" to predict the quality based on physicochemical properties. Quality is a subjective measure, given by the average grade of three experts.

Before starting the predictions, the report performs an exploratory data analysis (EDA) to look for features that may provide good prediction results, and also makes an short explanation about the metrics used in the models. In data preparation, the database are downloaded and processed in python. In this phase, the training and testing sets are created and they will be used during the model building.

There's a brief explanation of the models used in this report. Other important machine learning concepts, such as ensemble and cross validation, are also discussed.

The results section presents the best model for predicting quality and discuss why it was chosen for this purpose.

Editor: @flor14
Reviewer: <VORE_Margot> <Owoseni_Taiwo> <Nguyen_Nobby>

I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

thayeylolu · 2021-12-02T20:52:48Z

Data analysis review checklist

Reviewer: @thayeylolu

Conflict of interest

As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the MDS code of conduct.

General checks

Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

Installation instructions: Is there a clearly stated list of dependencies?
Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
Style guidelines: Does the code adhere to well known language style guides?
Modularity: Is the code suitably abstracted into scripts and functions?
Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

Data: Is the raw data archived somewhere? Is it accessible?
Computational methods: Is all the source code required for the data analysis available?
Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

Authors: Does the report include a list of authors with their affiliations?
What is the question: Do the authors clearly state the research question being asked?
Importance: Do the authors clearly state the importance for this research question?
Background: Do the authors provide sufficient background information so that readers can understand the report?
Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
Results: Do the authors clearly communicate their findings through writing, tables and figures?
Conclusions: Are the conclusions presented by the authors correct?
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour 20 mins

Review Comments:

(Note: This comments are time-based)

I will suggest using back ticks to enclose code snippets of the usage section in the Readme.md to improve readability.
I commend you on the documentation (doc string of your functions). It looks neat. I noticed that the split.py isn't check to verify that your expected input file is a .csv file. What happens when it is not a .csv file. I would suggest you write a check (a try and except ) to catch unexpected cases and inform the user to use a .csv file.
I suggest listing the dependencies in the readme.md
I noticed there is no check to verify that the .csv file has the expected column names as used in your analysis. I will suggest writing a check to verify the input csv has the expected data column.
I will like to suggest including a title for the last three plots in the eda.ipynb.
I suggest showing the relationship between features as well in your eda. Perhaps an heatmap.

Analysis report

I will like to suggest listing the authors of the report.
Your project tells us what it aims to do, but it does not have a research question.
I will suggest including figure captions
I will like to suggest including a reference to a paper that talks about quality of wine. I could not find it in your reference.
It also does not tell us background information or the importance of the research question about this study. What materials have attempted similar study?

Kind remarks: There may be typos here 😄 .

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

voremargot · 2021-12-03T01:39:31Z

Data analysis review checklist

Reviewer: @Vorem

Conflict of interest

As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the MDS code of conduct.

General checks

Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

Installation instructions: Is there a clearly stated list of dependencies?
Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
Style guidelides: Does the code adhere to well known language style guides?
Modularity: Is the code suitably abstracted into scripts and functions?
[] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?
- The split.py script has no tests for its functions

Reproducibility

Data: Is the raw data archived somewhere? Is it accessible?
Computational methods: Is all the source code required for the data analysis available?
Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

Authors: Does the report include a list of authors with their affiliations?
- The authors are missing in the PDF version of the reprort! Make sure you give yourselves credit for the work!
What is the question: Do the authors clearly state the research question being asked?
Importance: Do the authors clearly state the importance for this research question?
Background: Do the authors provide sufficient background information so that readers can understand the report?
Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
Results: Do the authors clearly communicate their findings through writing, tables and figures?
- There is little discussion about what your model results mean, how the other models you tried performed, or what the best hyperparameters were.
- Wish there was more focus on the model building and less on the EDA plots
Conclusions: Are the conclusions presented by the authors correct?
- You need to show how other models perfromed to prove to the reader that the model you created is really the best.
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1hr

Review Comments:

OVERALL COMMENTS:

Overall, I really like your project. You did a great job at describing how the problem you were solving was important and how your findings could be utilized. The quality of wine did not seem important to me, but you made it very clear how your work would be impactful to the industry.
The final report would be stronger if it included more specific details. When reading, it felt like the modeling results were overlooked which is one of the most interesting part of your project. There were a lot of general statements about the models and methods being used but what I would have found more interesting were details about which tuned parameters were the best, how all the model scores compare, and what specifics you decided on for data cleaning. I was not convinced your final model was the best because I had no data to tell me it was better than other models you tried. Remember that your reader likely has a background with data analysis methods and they are going to your report to look for specific on what was done and what you found.
It would be good to check over your rendered final report. A bunch of the figures looked like they were cut off and it would be helpful if you added captions. Also, you worked so hard on this so make sure you put your names in the report!
I really liked how the ReadMe was set up with clear sections. It made the document clear and easy to follow. In the usage section, adding some style formatting might make it clearer as to what is code and what are comments.
You clearly did a lot of work on the model. The model script is very easy to follow and shows all the hyperparameter tuning that went into finding the best model for the problem. It would be really nice to see a summary of this work in the final report. I think it would make it clearer that the model you chose was the best. You put a ton of work into your model so it would be nice to showcase it more in the report!
Adding more specifics to your summary would be benifical. As the reader I want to immediatly know some specifics about your EDA, hyperparameter tuning and most importantly, your results. At least in the scientific community, it is expected that reading the summary/abstract should tell you a condensed version of what was done, what the findings were, and what conclusions you came to.

SMALL ERRORS I NOTICED:

In the documentation for EDA.py it would be helpful to mention specific figures are being outputted so the reader doesn't have to search through the code to find out.
The preprocess script does not have any documentation or usage instructions in it.
You have 4 files in your raw data folder but you only mention two of them in the report/readme. What are the others?
The figures need captions in your report.
Figures are being cutoff in the PDF report and are missing in the markdown file

nobbynguyen · 2021-12-03T23:59:32Z

Data analysis review checklist

Reviewer: @nobbynguyen

Conflict of interest

As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the MDS code of conduct.

General checks

Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

Installation instructions: Is there a clearly stated list of dependencies?
Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
Style guidelides: Does the code adhere to well known language style guides?
Modularity: Is the code suitably abstracted into scripts and functions?
Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

Data: Is the raw data archived somewhere? Is it accessible?
Computational methods: Is all the source code required for the data analysis available?
Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

Authors: Does the report include a list of authors with their affiliations?
What is the question: Do the authors clearly state the research question being asked?
Importance: Do the authors clearly state the importance for this research question?
Background: Do the authors provide sufficient background information so that readers can understand the report?
Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
Results: Do the authors clearly communicate their findings through writing, tables and figures?
Conclusions: Are the conclusions presented by the authors correct?
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Overall, I enjoy reading your interesting data analysis. I am impressed by how you challenged yourself to work with different algorithms to perform the task. However, in my opinion, there are still rooms for improvements as follows:

It is great that you have a environment.yaml file in your root folder, it would also be nice if you can instruct how to install the environment in README.md file.
There are 6 scripts in src folder but only 5 scripts are use in README.md file.
The name of the scripts are not consistent compared to README.md file. For example, in src folder, the script's name is analyze.py, while in README.md file it is analysis.py. The same for eda.py script.
I agree with @thayeylolu and @voremargot that the graphs used (especially in the report) could be made more readable by including titles or figure captions. In addition, the plots did not show fully in the report. There are only 6 plots, and I believe you guys have more variables than that.
In the Methods part of the final report, you indicate that your task is to focus on what white wines feature are important to get the promising result. It would more clear if you could summarise your conclusion of feature importance in this part, which features are choosen, and why.
Related to Methods: References can be made in this part, for example sklearn.
In the Results and Discussion part of the final report, it could be great if the results of other models are also displayed to prove that how better KNN model is. The best parameter has not been discussed yet.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

katerinkus · 2021-12-05T08:33:31Z

Data analysis review checklist

Reviewer: @katerinkus

Conflict of interest

As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the MDS code of conduct.

General checks

Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[] Installation instructions: Is there a clearly stated list of dependencies?
[] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
Style guidelides: Does the code adhere to well known language style guides?
Modularity: Is the code suitably abstracted into scripts and functions?
Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

Data: Is the raw data archived somewhere? Is it accessible?
Computational methods: Is all the source code required for the data analysis available?
Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

Authors: Does the report include a list of authors with their affiliations?
What is the question: Do the authors clearly state the research question being asked?
Importance: Do the authors clearly state the importance for this research question?
Background: Do the authors provide sufficient background information so that readers can understand the report?
Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?

Could not complete this part

Results: Do the authors clearly communicate their findings through writing, tables and figures?
Conclusions: Are the conclusions presented by the authors correct?
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.75h (changed from 1.25)

Review Comments:

README and folder organization

A bit unclear how to find dependencies. My first instinct was to search for "Dependencies" on the page and then noticed the environment.yaml file. Would be easier to see this information in the README under Dependencies.
Could not locate the command for installing .yaml file (i.e. conda env create -f environment.yaml). People may not know how to run these commands.
I am curious to know how the data is collected or what your assumptions were in regards to data collection (i.e. are these randomly collected independent observations?)
Nitpicking comment: found a typo -- "Navie" instead of "Naive".

Replicating the project

As others have pointed out, I would include the conda activate somelier in the instructions. If someone has not used it before, they may be unsure what to do. The same goes for deactivation process after the project is done.
I really appreciated the message describing what is happening (i.e. "Starting to train"). It keeps the user engaged. Otherwise, it would have been just the fan noise to keep me company.

The report

I found model descriptions well-written. Even if someone did not use them before, they could understand what they do from their description.
The graphs looked great. My suggestion would be to do violins instead of box plots in case they hide some kind of bimodal or other distribution. But it may be excessive.
I would elaborate on the metrics. I was not sure what mae stands for, for example.
Otherwise, the report was clear and concise.

Side note regarding Make

I wanted to try Make file just for fun but I struggled to replicate the project in this way. You can see what I got below. I replaced my user name with --- and the folder path with the word mypath for privacy reasons. This was not listed in the instructions and I believe is unnecessary for this review but maybe it could he helpful for you:

$ make all
python src/download_data.py --url=http://www3.dsi.uminho.pt/pcortez/wine/winequality.zip --path=data/raw/
Data downloaded to data/raw/
python src/split.py data/raw/winequality/winequality-white.csv data/processed
python src/ml_models.py data/processed results/raw_results
Starting to train Dummy...
C:\Users\---\miniconda3\envs\somelier\lib\site-packages\sklearn\model_selection\_search.py:292: UserWarning: The total space of parameters 2 is smaller than n_iter=10. Running 2 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
Finished training Dummy!
 ------------
Starting to train Ridge...
C:\Users\---\miniconda3\envs\somelier\lib\site-packages\sklearn\model_selection\_search.py:292: UserWarning: The total space of parameters 6 is smaller than n_iter=10. Running 6 iterations. For exhaustive searches, use GridSearchCV.
  warnings.warn(
Finished training Ridge!
 ------------
Starting to train Random Forest...
Finished training Random Forest!
 ------------
Starting to train KNN...
Finished training KNN!
 ------------
Starting to train Bayes...
Finished training Bayes!
 ------------
Starting to train SVM...
Finished training SVM!
 ------------
python src/analyze.py --r_path=results
best_model.csv created at location /results/
python src/EDA.py data/processed/X_train.csv data/processed/y_train.csv results
Traceback (most recent call last):
  File "C:\Users\---\mypath\predicting_wine_quality\src\EDA.py", line 93, in <module>
    main()
  File "C:\Users\---\mypath\predicting_wine_quality\src\EDA.py", line 44, in main
    chart1.save(
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\vegalite\v4\api.py", line 476, in save
    result = save(**kwds)
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\utils\save.py", line 112, in save
    mimebundle = spec_to_mimebundle(
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair\utils\mimebundle.py", line 60, in spec_to_mimebundle
    return altair_saver.render(spec, format, mode=mode, **kwargs)
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\_core.py", line 257, in render
    mimebundle.update(saver.mimebundle(fmt))
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_saver.py", line 90, in mimebundle
    bundle[mimetype] = self._serialize(fmt, "mimebundle")
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_node.py", line 114, in _serialize
    spec = self._vl2vg(spec)
  File "C:\Users\---\miniconda3\envs\somelier\lib\site-packages\altair_saver\savers\_node.py", line 68, in _vl2vg
    return json.loads(vg_json)
  File "C:\Users\---\miniconda3\envs\somelier\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Users\---\miniconda3\envs\somelier\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\---\miniconda3\envs\somelier\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)
make: *** [Makefile:17: relationship_between_individual_features_and_the_quality_3.png] Error 1

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

NikitaShymberg · 2021-12-10T20:59:21Z

Thank you all for the feedback!

Regarding the comments about missing authors here here and here we have added a list of authors to the report in this commit
Regarding this comment about needing more info about training the model, we added this info in this commit
Regarding this comment, we added a heatmap to our EDA in this commit
Regarding this comment, we proofread the document and fixed the grammar and spelling errors in this commit
Regarding this comment we added usage instructions in this commit
Regarding this comment, we added figure captions in this commit

katerinkus mentioned this issue Dec 5, 2021

Submission: GROUP 28: Abalone Age Classifier #13

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: Group 13: Predicting Wine Quality #20

Submission: Group 13: Predicting Wine Quality #20

NikitaShymberg commented Nov 30, 2021

thayeylolu commented Dec 2, 2021 •

edited

Loading

voremargot commented Dec 3, 2021

nobbynguyen commented Dec 3, 2021 •

edited

Loading

katerinkus commented Dec 5, 2021 •

edited

Loading

NikitaShymberg commented Dec 10, 2021

Submission: Group 13: Predicting Wine Quality #20

Submission: Group 13: Predicting Wine Quality #20

Comments

NikitaShymberg commented Nov 30, 2021

thayeylolu commented Dec 2, 2021 • edited Loading

Data analysis review checklist

Reviewer: @thayeylolu

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour 20 mins

Review Comments:

Attribution

voremargot commented Dec 3, 2021

Data analysis review checklist

Reviewer: @Vorem

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1hr

Review Comments:

nobbynguyen commented Dec 3, 2021 • edited Loading

Data analysis review checklist

Reviewer: @nobbynguyen

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Attribution

katerinkus commented Dec 5, 2021 • edited Loading

Data analysis review checklist

Reviewer: @katerinkus

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.75h (changed from 1.25)

Review Comments:

Attribution

NikitaShymberg commented Dec 10, 2021

thayeylolu commented Dec 2, 2021 •

edited

Loading

nobbynguyen commented Dec 3, 2021 •

edited

Loading

katerinkus commented Dec 5, 2021 •

edited

Loading