Submission: GROUP_10: Online Shoppers Purchasing Intention #5

ytz · 2021-11-29T19:42:39Z

Submitting authors: @nicovandenhooff, @arijc76, @ytz

Repository: https://github.com/UBC-MDS/online-shoppers-purchasing-intention
Report link: https://ubc-mds.github.io/online-shoppers-purchasing-intention/intro.html
Abstract/executive summary: The research question that we are attempting to answer with our analysis is a predictive question, and is stated as follows:

Given clickstream and session data of a user who visits an e-commerce website, can we predict whether or not that visitor will make a purchase?

Nowadays, it is common for companies to sell their products online, with little to no physical presence such as a traditional brick and mortar store. Answering this question is critical for these types of companies in order to ensure that they are able to remain profitable. This information can be used to nudge a potential customer in real-time to complete an online purchase, increasing overall purchase conversion rates. Examples of nudges include highlighting popular products through social proof, and exit intent overlay on webpages.

Our final model is a tuned random forest, outputting 268 false positives, and 88 false negatives. The macro average recall score is 0.827 and the macro average precision score is 0.748, which is above our budget of 0.60 that we set at the beginning of our project.

Editor: @flor14
Reviewer: @Sanchit120496, @MacyChan, @shivajena

I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Sanchit120496 · 2021-12-03T06:29:41Z

Data analysis review checklist

Reviewer: @Sanchit120496

Conflict of interest

As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the MDS code of conduct.

General checks

Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
- License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Documentation
- Installation instructions: Is there a clearly stated list of dependencies?
- Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
- Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
- Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

MacyChan · 2021-12-03T07:40:00Z

Data analysis review checklist

Reviewer: @MacyChan

Conflict of interest

As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the MDS code of conduct.

General checks

Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

Installation instructions: Is there a clearly stated list of dependencies?
Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
Style guidelides: Does the code adhere to well known language style guides?
Modularity: Is the code suitably abstracted into scripts and functions?
Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

Data: Is the raw data archived somewhere? Is it accessible?
Computational methods: Is all the source code required for the data analysis available?
Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

Authors: Does the report include a list of authors with their affiliations?
What is the question: Do the authors clearly state the research question being asked?
Importance: Do the authors clearly state the importance for this research question?
Background: Do the authors provide sufficient background information so that readers can understand the report?
Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
Results: Do the authors clearly communicate their findings through writing, tables and figures?
Conclusions: Are the conclusions presented by the authors correct?
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

~90 mins

Review Comments:

This is a practical topic and I can totally see the possibility of real life use case in the e-commerce industry.

Classification question is clearly defined with reason of interests and motivation.
Clear project plan, tools and procedures on how to achieve your final result. Any reason you picked those models?
It would be better to have some visualisation in Introduction - Data Cleaning, for example, visualising the outliners. (I guess it also relates to the Data Analysis - Distribution part, but it is not clearly pointed out)
Even though ReadMe has written the data structure and explanations, it is hard to picture the data that you are studying. The EDA (big correlation graph/ bar chart) is a little bit overwhelmed. It would be nice to pick some important features and explain them in details as well.
The Model selection part is easy to follow. A nitpicked comment maybe the best score be highlighed among models/ have some indicators/ visualisation to show the difference of the scoring.
I appreciate the Statement of future direction session. I know what to look forward to in the upcoming release. =)

Since @Sanchit120496 focused on the script, I spent more time on the reading material

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

shivajena · 2021-12-04T22:12:01Z

Data analysis review checklist

Reviewer: <GITHUB_USERNAME>

Conflict of interest

As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the MDS code of conduct.

General checks

Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

Installation instructions: Is there a clearly stated list of dependencies?
Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
Style guidelines: Does the code adhere to well known language style guides?
Modularity: Is the code suitably abstracted into scripts and functions?
Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

Data: Is the raw data archived somewhere? Is it accessible?
Computational methods: Is all the source code required for the data analysis available?
Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[] Authors: Does the report include a list of authors with their affiliations?
What is the question: Do the authors clearly state the research question being asked?
Importance: Do the authors clearly state the importance for this research question?
Background: Do the authors provide sufficient background information so that readers can understand the report?
Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
Results: Do the authors clearly communicate their findings through writing, tables and figures?
Conclusions: Are the conclusions presented by the authors correct?
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

I enjoyed reading the report which is very well structured and highlights the importance of the analysis along with its practicality. For e.g., in the model selection metrics, the linking of focus areas on the errors with business context was excellent and as an ex-management consultant, I cannot stress enough how important this is to convince a decision making process at the senior management levels. After having a review of the work, here are my observations on some of the sections:

Feature Engineering: While it is good to see the new features created, they need a bit more explanation in terms of rationale behind the process or why it was needed in the first place. Further, in the analysis part, they can be evaluated on whether they are statistically meaningful to add them through anova. This is a bit far fetched thing, but could be tried to add much more credibility.
Model Selection: Hyperparameter tuning wasnt done for different models, and as such, the models were set to their default hyperparameter values. In such scenario, an individual model particularly SVC (not logistic reg) may not be compared appropriately with tree based classifiers such as XGBoost & RF, which create multiple sub-trees and optimise the fit. Although it might be computationally very intensive, best hyper-parameters across models could be tried. Because, in most of the cases, RF will automatically stand as the best classifier by this approach!
The story telling in the EDA can be bit more aligned towards the next step. While this has been attempted at atomic levels, but there can be an EDA summary section briefing the whole message in a crisp manner, leading to the next section. In other words, rather than putting key observations in subsections, you can summarise them briefly in one section for better comprehension.
In presenting the distributions of the features, the x axis scale can be truncated to ignore extreme values and visualise important features in the distribution such as the extent of class imbalance and the central tendencies.
The model tuning and results section comes to a bit of an abrupt end without explaining future scope on the classifications or what were the limitations of the analysis or what else can be done for better predictions.
The authors name are missing in the report which may be added.

Rest, I think this is one of the best reports i have read, and commendable efforts put in here. I must say I learnt quite a lot from your analysis, such as smart use of feature engineering for one. All the best.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ytz · 2021-12-10T04:56:22Z

😊 Thanks for the feedbacks!

Regarding point no.6 from this comment, we have added our names to the report, as seen in UBC-MDS/online-shoppers-purchasing-intention@91b1c67
Regarding point no.3 from this comment, we have summarized key observations under data analysis in the report, as seen in UBC-MDS/online-shoppers-purchasing-intention@96be4d8
Regarding point no.1 from this comment, we have added the missing folders, using .gitkeep instead of a dummy text file, as seen in UBC-MDS/online-shoppers-purchasing-intention@6c0383e
Regarding point no.5 from this comment, we have added a conclusion in our report, as seen in UBC-MDS/online-shoppers-purchasing-intention@81c701f

ytz assigned shivajena, MacyChan and Sanchit120496 Nov 29, 2021

ytz mentioned this issue Dec 8, 2021

Respond to feedbacks UBC-MDS/online-shoppers-purchasing-intention#71

Closed

ytz mentioned this issue Dec 10, 2021

Other tasks UBC-MDS/online-shoppers-purchasing-intention#72

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission: GROUP_10: Online Shoppers Purchasing Intention #5

Submission: GROUP_10: Online Shoppers Purchasing Intention #5

ytz commented Nov 29, 2021 •

edited

Loading

Sanchit120496 commented Dec 3, 2021

Documentation

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

MacyChan commented Dec 3, 2021 •

edited

Loading

shivajena commented Dec 4, 2021 •

edited

Loading

ytz commented Dec 10, 2021

Submission: GROUP_10: Online Shoppers Purchasing Intention #5

Submission: GROUP_10: Online Shoppers Purchasing Intention #5

Comments

ytz commented Nov 29, 2021 • edited Loading

Sanchit120496 commented Dec 3, 2021

Data analysis review checklist

Reviewer: @Sanchit120496

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

Attribution

MacyChan commented Dec 3, 2021 • edited Loading

Data analysis review checklist

Reviewer: @MacyChan

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Attribution

shivajena commented Dec 4, 2021 • edited Loading

Data analysis review checklist

Reviewer: <GITHUB_USERNAME>

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

Review Comments:

Attribution

ytz commented Dec 10, 2021

ytz commented Nov 29, 2021 •

edited

Loading

MacyChan commented Dec 3, 2021 •

edited

Loading

shivajena commented Dec 4, 2021 •

edited

Loading