Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: GROUP 7: tech_salary_predictor_canada #17

Open
1 task
hjw0703 opened this issue Nov 30, 2021 · 8 comments
Open
1 task

Submission: GROUP 7: tech_salary_predictor_canada #17

hjw0703 opened this issue Nov 30, 2021 · 8 comments
Assignees

Comments

@hjw0703
Copy link

hjw0703 commented Nov 30, 2021

Submitting authors: @suuuuperNOVA @khalidcawl @Sanchit120496 @hjw0703
**Repository:**https://github.com/UBC-MDS/tech_salary_predictor_canada_us
**Report link:**https://github.com/UBC-MDS/tech_salary_predictor_canada_us/blob/main/doc/tech_salary_predictor_report/03_result.ipynb
Abstract/executive summary:
Graduates and seasoned tech employees may have a question about how much they should get paid from their employers for the reason that salary is never transparent information. Lack of enough information, graduates may feel lost and insecure and job seekers may be at disadvantage when having salary discussion with HRs. Hence, we come up with this idea to build up a model to predict the pay that technicians can expect based on several explicit factors including education level, previous experience, location etc.

The data set used in this project is sourced from the survey, Stack Overflow Annual Developer Survey, which is conducted annually with nearly 80000 responses from different backgrounds. Based on the survey results, much useful features could be extracted such as education level, location, the language used, job type, all of which are potentially associated with the annual compensation.

Editor: @flor14
Reviewer: @suuuuperNOVA @khalidcawl @Sanchit120496 @hjw0703

  • I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.
@hjw0703 hjw0703 changed the title Submission: <GROUP 7: tech_salary_predictor_canada> Submission: GROUP 7: tech_salary_predictor_canada> Nov 30, 2021
@hjw0703 hjw0703 changed the title Submission: GROUP 7: tech_salary_predictor_canada> Submission: GROUP 7: tech_salary_predictor_canada Nov 30, 2021
@khbunyan
Copy link

khbunyan commented Dec 1, 2021

Data analysis review checklist

Reviewer: @khbunyan

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Nice work! This is a really interesting and relevant topic, I enjoyed reading your analysis. Overall it's coming along well, and I appreciate that it's clear how to replicate your analysis. I would take a look at the following things in the project:

  1. Review the grammar throughout the report and in the README file. The content is fine, your motivation and ideas are coming across, but there are some sentences that need fixing and the overall flow could be improved so I'd recommend giving the written sections a full review.

  2. A few technical things:

    • I had to download the jupyter book package, not just jupyter lab, to get the book to render so I would consider adding that to your dependencies.
    • Your preprocessing script wouldn't work for me unless I made a "processed" folder manually inside of "data". Not sure if that's just on my end worth but looking into, tried re-downloading the repo and it didn't fix it.
  3. I think some of your jupyter book pages could be consolidated. For example: 00_index.md, 01_introduction.md. Putting these together would clean up the report and make it easier to navigate, no need to have a title on its own page.

  4. Consider including a "future analysis" section to your report and discussing your results. Were they what you expected? How do you think your model could be improved in the future? Right now the results are just stated, some context to that would help guide and inform the reader.

  5. I would add your conclusions and an expanded discussion of your analysis and model decisions to the README file so the reader can get a full birds eye view of the analysis from start to finish.

  6. Add the license to the README file so it's easier to find.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@arlincherian
Copy link

arlincherian commented Dec 1, 2021

Data analysis review checklist

Reviewer: @arlincherian

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis? (Missing some here)
  • Automation: Can someone other than the authors easily reproduce the entire data analysis? (** had some trouble with automation**)

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures? (written part could be expanded on)
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: ~ 45 minutes

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Nice work, everyone. This is a very fitting topic for all of us as we will be job hunting pretty soon. The proposal was very clear in understanding the research questions and methodologies that were being applied. I also liked that you provided instructions for the users to download and replicate data.

A couple of comments:

  1. When I run the pre-processing script, I receive an error: Error: Cannot open file for writing: 'data/processed/training.csv' Execution halted. The processed folder within the data folder had to be created manually to run the rest of the scripts.
  2. When I tried running the script for EDA, it required me to install altair and altair-saver packages. Python packages are not added in your dependencies list, so you may want to add that as well. I see some packages mentioned in the requirements.txt file but they are missing the R packages, so maybe consolidate all of these into the requirement file or under the dependencies in the readme file or even create an environment.yml file.
  3. In your repo directory organization, the ‘doc’ folder is currently named as ‘doc/tech_salary_predictor_report’, I think if you delete the ‘/‘ after the report name, this will be fixed as you want the latter to just be a file and not a folder.
  4. I would suggest consolidating all the various literate document md files to one final report that summarizes the project, research question, data, methods, eda, modelling, testing, results and discussion in more detail.
  5. I would also suggest updating the readme file with some conclusions from the eda and the final analysis. What did you find from your analysis? Is it answering your prediction questions and sub-questions?
  6. The contributing file needs to be updated. It seems like the title of the project was meant to be added but it’s missing.
  7. Suggestions: I would suggest explicitly stating the data licensing information in the readme file as well.
  8. Suggestions: Another suggestion would be to include the names of the authors/ creators in the proposal/readme file.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@nd265
Copy link

nd265 commented Dec 3, 2021

Data analysis review checklist

Reviewer: @nd265

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelines: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Overall, the project looks good to me. The thought behind it is great. The report writing is also good. The idea of doing this is coming out reasonably well. Overall, the code quality is also very nice.

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Below are some of the areas where you could improve on and make your project crisp:

  1. I could see one of your team member's folder path mentioned as a prefix towards the end of the environment.yaml file, I think you can remove it as it is unnecessary
  2. The heading in the Contributing.md file should include the Project Name. Below is the screenshot for your reference

Contributing to the  Project Name

  1. The grammar in Code of Conduct seems a bit off and I suggest that you work upon it and make it error free
  2. The variables of the plots can be renamed to better ones, which are self explanatory. Here is a snippet of your code as an example

Pasted Graphic 1

  1. I suggest that you can add python dependencies to the README.md file as they are missing.

  2. I would also suggest updating the readme file with some conclusions from the EDA and the final analysis in terms of 'What did you find from your analysis?' , how is it answering your research question

  3. I suggest that you can include the dependency for jupyter book in the dependency list and environment.yaml file , as when I tried to run the command, i had to install it manually

  4. The preprocessed folder is not getting created through the commands in the data folder

  5. I would suggest explicitly stating the data licensing information in the README.md file as well

  6. I would also suggest to include the names of the authors/ creators in the proposal/readme file and report

  7. I would suggest you to consider including a "future analysis" section to your report and discussing your results. How do you think your model could be improved in the future? That would really provide a good insight for the readers

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@khalidcawl
Copy link

Thank you @khbunyan @arlincherian and @nd265 for the constructive feedback. We will incorporate these changes into the project. Much appreciated!

@Anahita97
Copy link

Anahita97 commented Dec 4, 2021

Data analysis review checklist

Reviewer: <GITHUB_USERNAME>

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Good job! I really enjoyed reading your project. Here are some comments:

  • Very minor suggestion for the readme file. Adding --- after each relevant section could help divide the headings into their own subsection. For example, it'll look nicer if the sections about and Dependencies are separated from each other by a line divider.

  • It was a bit unclear to me as to how you determined the top five significant features of salary prediction. I can see that there are so many features in your data, and therefore looking into every single one of them does not seem to be practical. However, it could be worthwhile to try some pair plots so we can also get a glimpse into some other distributions and get a better idea as to why they have not been selected.

  • It could also be worthwhile to explain why R-squared was chosen for performance evaluation. I see that the regression model used is a multivariate regression model, and therefore it might be beneficial to consider the adjusted R-squared score instead. The adjusted R-squared has been adjusted for the number of predictors in the model.

  • It could also be a good idea to discuss your limitations in the Results & Discussion section.

  • How was the outliers detected? It seems like you have removed the top 8%. I think it could also be a good idea to look into some outlier detection methods such as the Cook's Distance method.

  • I really liked how you guys organized your file, as it is very tidy and organized.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@Anahita97 Anahita97 self-assigned this Dec 6, 2021
@suuuuperNOVA
Copy link

suuuuperNOVA commented Dec 10, 2021

Data analysis review checklist

Reviewer: @khbunyan

  1. Review the grammar throughout the report and in the README file. The content is fine, your motivation and ideas are coming across, but there are some sentences that need fixing and the overall flow could be improved so I'd recommend giving the written sections a full review.

  2. A few technical things:

    • I had to download the jupyter book package, not just jupyter lab, to get the book to render so I would consider adding that to your dependencies.
    • Your preprocessing script wouldn't work for me unless I made a "processed" folder manually inside of "data". Not sure if that's just on my end worth but looking into, tried re-downloading the repo and it didn't fix it.
  3. I think some of your jupyter book pages could be consolidated. For example: 00_index.md, 01_introduction.md. Putting these together would clean up the report and make it easier to navigate, no need to have a title on its own page.

  4. Consider including a "future analysis" section to your report and discussing your results. Were they what you expected? How do you think your model could be improved in the future? Right now the results are just stated, some context to that would help guide and inform the reader.

  5. I would add your conclusions and an expanded discussion of your analysis and model decisions to the README file so the reader can get a full birds eye view of the analysis from start to finish.

  6. Add the license to the README file so it's easier to find.

Hi,

Thanks for your evaluation. Based on your suggestions, our team has made the following changes:

  1. The structure of the report was changed, which now is divided into two parts, Project proposal and Final report. Please check commit@785ac27.
  2. The bug of missing processed folder when downloading data was fixed. Please check commit@f5cb046.
  3. We added "future analysis" in our report. Please check commit@6740199.

@suuuuperNOVA
Copy link

suuuuperNOVA commented Dec 10, 2021

Data analysis review checklist

Reviewer: @nd265

  1. I could see one of your team member's folder path mentioned as a prefix towards the end of the environment.yaml file, I think you can remove it as it is unnecessary
  2. The heading in the Contributing.md file should include the Project Name. Below is the screenshot for your reference
Contributing to the Project Name
  1. The grammar in Code of Conduct seems a bit off and I suggest that you work upon it and make it error free
  2. The variables of the plots can be renamed to better ones, which are self explanatory. Here is a snippet of your code as an example
Pasted Graphic 1
  1. I suggest that you can add python dependencies to the README.md file as they are missing.
  2. I would also suggest updating the readme file with some conclusions from the EDA and the final analysis in terms of 'What did you find from your analysis?' , how is it answering your research question
  3. I suggest that you can include the dependency for jupyter book in the dependency list and environment.yaml file , as when I tried to run the command, i had to install it manually
  4. The preprocessed folder is not getting created through the commands in the data folder
  5. I would suggest explicitly stating the data licensing information in the README.md file as well
  6. I would also suggest to include the names of the authors/ creators in the proposal/readme file and report
  7. I would suggest you to consider including a "future analysis" section to your report and discussing your results. How do you think your model could be improved in the future? That would really provide a good insight for the readers

Hi,

Thanks for your evaluation. Based on your suggestions, our team has made the following changes:

  1. The project name is added to the title of CONTRIBUTING.md. Please check commit@16495ef.
  2. Labels of plots have been taken care of. Please check commit@e636894.
  3. Grammar mistakes have been corrected. Please check commit@785ac27.
  4. environment.yml has been added to the project. Please check commit@2c0aba.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants