Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: Group 20 - Covid Reddit Behaviour #9

Closed
1 task done
nobbynguyen opened this issue Nov 30, 2021 · 7 comments
Closed
1 task done

Submission: Group 20 - Covid Reddit Behaviour #9

nobbynguyen opened this issue Nov 30, 2021 · 7 comments
Assignees

Comments

@nobbynguyen
Copy link

nobbynguyen commented Nov 30, 2021

Submitting authors: @mel-liow @LukeAC @nobbynguyen @MaeveShi

Repository: https://github.com/UBC-MDS/covid_reddit_behaviour
Report link: https://ubc-mds.github.io/covid_reddit_behaviour/introduction.html

Abstract/executive summary:
Here we attempt to look into the reddit mental health dataset ([https://zenodo.org/record/3941387#.YZl5BC1h1QL]from which we have selected 15 mental-health-specific subreddit datasets. These datasets contain collections of reddit user posts from 2018-2020. We aim to find the impact of COVID-19 on mental health support groups by looking into the data before and after the pandemic. Specifically, we aim to focus the question:

Has frequency of explicit description of substance abuse in mental-health-oriented subreddits changed over the time period of 2018 through 2020.

For the first week, we conducted exploratory data analysis on 30 datasets (15 mental subreddits, each with a self-described 'pre' and 'post' pandemic dataset), which can be found here. The exploratory data analysis mainly focus on these parts:

  • Features: We explored the features in details by the published paper and decided to only include substance_use_total, subreddit, author, date, post, and exclude all other features, because they are the only ones relevant to the question.

  • High Level Analysis: We checked if there's any missing values in datasets, as well as what needs to be cleaned. Then we concatenated the pre and post data set to see the difference of descriptive variables.

  • Visualization: We showed the plot of substance_use_total distribution before and after the covid to gain a better understanding of our question.

Dataset:

  • The datasets we used contain reddit user posts and text-derived metrics (e.g. the substance_use_total feature) from 15 mental health subreddits: r/EDAnonymous, r/addiction, r/alcoholism, r/adhd, r/anxiety, r/autism, r/bipolarreddit, r/bpd, r/depression, r/healthanxiety, r/lonely, r/ptsd, r/schizophrenia, r/socialanxiety, and r/suicidewatch.

Timeframe of datasets:

  • 'post' [pandemic]: Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears). Unique users: 320,364.
  • 'pre' [pandemic]: Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts. Unique users: 327,289.

More information can be found here

Editor: @mel-liow @LukeAC @nobbynguyen @MaeveShi
Reviewer: @Arushi282 @Anupriya-Sri @xiangwxt @Stoll_Allyson

  • I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.
@xiangwxt
Copy link

xiangwxt commented Dec 2, 2021

Data analysis review checklist

Reviewer: @xiangwxt

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3

Review Comments:

Well done. This is topic is really interesting and I enjoy exploring your data analysis. There are a few comments I would like to make as follows:

  • It would have been nice to include figure title's in the EDA and make the charts more readable.

  • For the reference part of the final report, I think the content was missing but I did find a .bib file for the reference. Maybe it's better to double check if there's something missed. Also some of the hyperlinks in the final report were broken, you may want to double check if they work properly.

  • The codes were very readable as you included a lot of descriptions for each steps, that's really good. But it would have been better if you can also include codes to verify the functionality of the codes, for example some tests.

  • I really like the workflow summary chart in the readme file. It helps audiences understand the structure of this repo.

  • I think the repo contains source codes needed for the data analysis process, but the organization is a little bit confusing. I feel like it's better to store code files together in a directory like /src.

  • Overall the research questions in this data analysis is clear, the conclusion part stated that the difference pre and after pandemic in 2 subsets of data was significant. It would have been better if you can mention the overall conclusion for the study.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Originally posted by @khbunyan in #17 (comment)

@mel-liow
Copy link

mel-liow commented Dec 2, 2021

Thanks @xiangwxt for your feedback! Just on the references point - there's a page called references in our report which can be found here: https://ubc-mds.github.io/covid_reddit_behaviour/references.html. Not sure if you missed this!

@Arushi282
Copy link

Data analysis review checklist

Reviewer: @Arushi282

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2.5

Review Comments:

This is a very interesting project and overall I enjoyed reading your analysis. However, I do feel that some improvements can be made to your project to make it more accessible and clear:

  • While it is great that you have a conda environment setup and that you go through how to install the environment, it would also be nice to have a Dependencies section in your Readme, listing all the packages we need to run the project.
  • I don’t see the raw or preprocessed data in your project and so I just wanted to bring this to your attention. I know that you have a text file (file_to_download.txt) with the links to all the datasets, but I would encourage you to also include the CSVs in your raw data file.
  • It would be nice if you could explain more on the Reddit platform because there could be people that are not familiar with the platform and how it actually works.
  • While it is really great and interesting to see EDA done for all your different datasets, I do think that the EDA notebooks could use a little bit more documentation. It would be better if you could give small explanations or document the code used. Especially for the EDA summary notebook, more explanation would definitely increase understanding.
  • I think it is important to give a little bit more background on your datasets in your Readme, especially if you are going to talk about features used. Like initially when I first started reviewing your project, the first thing I went over was your Readme and so I was definitely getting confused with what “substance_use_total” meant. I think it is also good to include what kind of analysis you are doing (inferential or prediction) and what tests( in this case Wilcoxon rank sum test) you are using. It would also be nice to include a high level conclusion from your analysis.
  • Going off my previous point, it would also be nice to explain what Wilcoxon rank sum test is and why it works for you analysis. This could be done in the report or Readme.
  • I agree with @xiangwxt that the graphs used (especially in the report) could be made more readable by including titles or figure captions. Additionally for the histograms, you could also maybe change the opacity to see overlaps between pre and post data.
  • In general I noticed grammatical errors in the report, especially in the results part of the Data Analysis section. Just wanted to bring that to your attention.
  • Also just a suggestion, I think it would be better to rename the results part of the Data Analysis section of your report to something like “Conclusions from plots”, it would just help clear any confusions between your next section which is named “Results”.

Overall, really great work! I like how all of you challenged yourself to work with so many datasets and you really provided a lot of information on each dataset used. Additionally, I also really liked how you decided to do something so relevant to today's times. Mental health is a subject that constantly needs to be talked about and so I really appreciate you taking this opportunity to do that.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@nobbynguyen
Copy link
Author

Hi @Arushi282, thank you for your comments! As regards Methods point - there's a section in our report discussing Methodology of our data analysis . Can you elaborate more on this point so that we can improve our report?

@datallurgy
Copy link

Data analysis review checklist

Reviewer: @datallurgy (Allyson Stoll)

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hrs

Review Comments:

  • The repo is well organized but information seems to be missing (raw and processed data, figures, etc.).
  • The license does not have the team member names, only Master of Data Science.
  • It appears tests exist for the EDA script, but I could not find them to review.
  • There is no data in the raw or processed folders.
  • The report links did not work for me. (Resulted in a 404 error no matter which link I followed for the report.) I reviewed the sections based on the TOC and the individual md files in the repo.
  • I did not see an authors list in the report. Maybe I missed it?
  • Concerning the methodology, could you explain LIWC? Something to do with word count? It doesn't make sense that there's 62 columns though. It's a good idea to spell out acronyms before using them.

Overall, very clean and well organized. The analysis seems reasonable and sound and I'm impressed to see that comment from the other reviewers have already been implemented. I agree maybe some additional background on Reddit may be warranted for those not familiar with the platform and additional comments in the EDA notebooks would be helpful (in addition to axis labels and titles).

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@Anupriya-Sri
Copy link

Data analysis review checklist

Reviewer: @Anupriya-Sri (Anupriya Srivastava)

Conflict of interest

  • As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

General checks

  • Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

  • Installation instructions: Is there a clearly stated list of dependencies?
  • Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
  • Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

  • Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
  • Style guidelides: Does the code adhere to well known language style guides?
  • Modularity: Is the code suitably abstracted into scripts and functions?
  • Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

  • Data: Is the raw data archived somewhere? Is it accessible?
  • Computational methods: Is all the source code required for the data analysis available?
  • Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
  • Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

  • Authors: Does the report include a list of authors with their affiliations?
  • What is the question: Do the authors clearly state the research question being asked?
  • Importance: Do the authors clearly state the importance for this research question?
  • Background: Do the authors provide sufficient background information so that readers can understand the report?
  • Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
  • Results: Do the authors clearly communicate their findings through writing, tables and figures?
  • Conclusions: Are the conclusions presented by the authors correct?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
  • Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hrs

Review Comments:

This study which is very relevant to our time, and I really enjoyed going through the analysis and the report. The repository was well structured and had clear instructions on how to replicate the results. I particularly liked the book format of the report as it had all sections clearly defined. I do have a few comments for your consideration:

  • Organization: The project repo was well structured and easy to navigate. The only problem that I had was that I was not able to locate the dataset here. This was a new domain for me, so I wanted to understand the data better before reviewing the analysis and absence of .csv made it a bit difficult.
  • Dataset: It is good that multiple data sets were used, but I am not sure if the amount of data used is enough for making statistical conclusions. We are using data for only 1 year before and 1 year after the pandemic. We use Power Analysis or some other techniques to evaluate that the amount of data is sufficient for the significance level of alpha = 0.05, and we can be sure of the results.
  • Methodology: I am not very strong at statistics, so had to google what Wilcoxon rank-sum statistic is. It would be good to provide some explanation of this test. I could also not understand why this was used as opposed to the more popular t-test, f-test or z-test. So, it would be good to explain why we are using a particular test.
  • Readability: I think that readability across all aspects was quite good – the code was thoroughly commented, the charts had good resolution, and the report was well written. However, I think that the EDA charts were not clear in terms of the message that was being conveyed. So, adding a line or two about what they are showing may be a good idea.
  • Additional Comment: I think that the report has a few acronyms and terms that were not clear to me, such as LIWC, TF-IDF, substance_use_total, adhd, bpd, ptsd. It would be good to convert the topic names to self-explanatory terms.
    I think that the work was very well done and had a lot of positive aspects, both it terms of analysis and in terms of presentation. I wish you the best as you develop it further.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

@LukeAC
Copy link

LukeAC commented Dec 9, 2021

Summary of Addressed Feedback

1. Concerning EDA figure titles, axis labels, and cosmetic changes:

eg. 1

...additional comments in the EDA notebooks would be helpful (in addition to axis labels and titles).

eg. 2

I agree with @xiangwxt that the graphs used (especially in the report) could be made more readable by including titles or figure captions. Additionally for the histograms, you could also maybe change the opacity to see overlaps between pre and post data.

See commits:

https://github.com/UBC-MDS/covid_reddit_behaviour/pull/61/commits

2. Concerning added description of charts:

eg. 1

However, I think that the EDA charts were not clear in terms of the message that was being conveyed. So, adding a line or two about what they are showing may be a good idea.

See commits:

https://github.com/UBC-MDS/covid_reddit_behaviour/pull/61/commits

3. Concerning feature descriptions:

eg. 1

I think it is important to give a little bit more background on your datasets in your Readme, especially if you are going to talk about features used. Like initially when I first started reviewing your project, the first thing I went over was your Readme and so I was definitely getting confused with what “substance_use_total” meant.

See commits:

https://github.com/UBC-MDS/covid_reddit_behaviour/pull/60/commits

4. Concerning misc. requests:

eg. 1

I think that the report has a few acronyms and terms that were not clear to me, such as LIWC, TF-IDF, ...

See commits:

UBC-MDS/covid_reddit_behaviour@1986e1e

@mel-liow mel-liow closed this as completed Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants