Authors: Ashley Ho & Mizuho Fukuda
Course: DSC80 at UCSD
Website: https://a1ho.github.io/Recipe-Ratings-Analysis/
GitHub: https://github.com/a1ho/Recipe-Ratings-Analysis
As nutritional science has become increasingly mainstream, more and more people are making conscious decisions to incorporate healthier diets when cooking at home. With so much information surrounding nutrition, what factors do we need to consider when choosing recipes that are both healthy and tasty? One major food content that a lot of people are concerned with is saturated fat, which is a kind of fat that is typically considered unhealthy and has been associated with heart and circulatory disease.
In this study, we explore this central question:
Are recipes that are lower in saturated fat content more popular than recipes that are higher in saturated fat content?
This research question seeks to explore whether individuals are concious about making healthy food choices when deciding on which recipes to try out. In essence, we are investigating whether recipes containing lower saturated fat contents are more popular amongst the general public than recipes higher in saturated fat. This research question may be of interest to people studying dietary habits and could provide insights into the factors influencing food choices in the ever-evolving landscape of nutrition and wellness.
The dataset that we explore in this project contains recipes and ratings from food.com that have been posted since 2008, with 83,782 different observations in the recipes
dataset and 731,927 different reviews/ratings the interacions
dataset. The columns from the recipes
dataset that are of interest to us throughout this project:
Column | Description |
---|---|
'id' |
unique ID of recipe |
'name' |
name of recipe |
'minutes' |
reported time each recipe takes to make, in minutes |
'nutrition' |
nutrition information in the form [calories (#), total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV), carbohydrates (PDV)]; PDV stands for “percentage of daily value” |
'n_steps' |
number of steps to make recipe |
'n_ingredients' |
number of ingredients to make recipe |
Also, the columns from the interactions
dataset that are of interest to us throughout this project:
Column | Description |
---|---|
'recipe_id' |
unique ID of recipe |
'rating' |
rating (out of 5) of recipe given a reviewer |
The recipes
DataFrame contains one row per recipe, and the interactions
dataframe contains one row per review of a recipe. The id
column in recipes
and the recipe_id
columns in interactions
are common columns, and thus we merge the two DataFrames using a left merge on recipes
. This results in a DataFrame merged
with one row per review of every recipe that appears in recipes
.
From merged
we calculate the average 'rating'
for each recipe in the DataFrame. Before doing so, we replace all ratings of 0 with np.nan
since a rating of 0 may acutually mean that the rating is missing if a reviewer forget to add a rating at the end of their review. Thus, we replace with np.nan
to ensure that when calculating the average rating, we do not factor in missing ratings as a rating of 0. Then, we add these average ratings as a new column 'mean_rating'
to the recipes
DataFrame.
For our research question, we focus on the popularity of a recipe. For the purposes of this project, we categorize a recipe's popularity by looking at how many reviews it receives, since we reason that recipes with more number of reviews have reached a wider audience, and hence are more popular. We used the merged
DataFrame to calculate the number of reviews a recipe receives and add these values as a new column 'n_reviews'
to the recipes
DataFrame.
After checking the data types of the columns, we notice that the 'nutrition'
column, which is present in both recipes
and merged
, actually contains strings that are formatted as lists instead of actual lists. For both DataFrames, we separate the values in the 'nutrition'
column into seperate these seperate columns: 'calories'
, 'total_fat'
, 'sugar'
, 'sodium'
, 'protein'
, 'saturated_fat'
, and 'carbohydrates'
. We note that all these values are in PDV units, or percentage of daily value.
Since the information in the 'review'
and 'description'
columns are not relevant to our analysis (except for their missingness), we condensed the long strings into only the first 20 characters for a better view of the DataFrame.
For the purposes of our analysis, we only kept the 'id'
, 'name'
, 'minutes'
, 'n_ingredients'
, 'n_steps'
, 'mean_rating'
('rating'
in merged
), 'n_reviews'
('review'
in merged
) and all the nutrition columns for both DataFrames recipes
and merged
. We also sort the recipes
DataFrame by id
for organizational purposes.
Here are the first 5 rows of the cleaned recipes
DataFrame:
id | name | minutes | calories | total_fat | sugar | sodium | protein | saturated_fat | carbohydrates | n_ingredients | n_steps | mean_rating | n_reviews |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
275022 | impossible macaroni and cheese pie | 50 | 386.1 | 34 | 7 | 24 | 41 | 62 | 8 | 7 | 11 | 3 | 3 |
275024 | impossible rhubarb pie | 55 | 377.1 | 18 | 208 | 13 | 13 | 30 | 20 | 8 | 6 | 3 | 1 |
275026 | impossible seafood pie | 45 | 326.6 | 30 | 12 | 27 | 37 | 51 | 5 | 9 | 7 | 3 | 2 |
275030 | paula deen s caramel apple cheesecake | 45 | 577.7 | 53 | 149 | 19 | 14 | 67 | 21 | 9 | 11 | 5 | 10 |
275032 | midori poached pears | 25 | 386.9 | 0 | 347 | 0 | 1 | 0 | 33 | 9 | 8 | 5 | 1 |
Here are the first 5 rows of the cleaned merged
DataFrame:
id | name | description | minutes | calories | total_fat | sugar | sodium | protein | saturated_fat | carbohydrates | n_ingredients | n_steps | rating | review |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
333281 | 1 brownies in the world best ever | these are the most; ... | 40 | 138.4 | 10 | 50 | 3 | 3 | 19 | 6 | 9 | 10 | 4 | These were pretty go... |
453467 | 1 in canada chocolate chip cookies | this is the recipe t... | 45 | 595.1 | 46 | 211 | 22 | 13 | 51 | 26 | 11 | 12 | 5 | Originally I was gon... |
306168 | 412 broccoli casserole | since there are alre... | 40 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 9 | 6 | 5 | This was one of the ... |
306168 | 412 broccoli casserole | since there are alre... | 40 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 9 | 6 | 5 | I made this for my s... |
306168 | 412 broccoli casserole | since there are alre... | 40 | 194.8 | 20 | 6 | 32 | 22 | 36 | 3 | 9 | 6 | 5 | Loved this. Be sure... |
Here we plotted the distribution of the minutes
column. We temporarily dropped all recipes with values in the 'minutes'
column greater than 600 to have a better view of the plot. When exploring the 'minutes'
column, we found a recipe with more than 1 million minutes of cooking time titled 'how to preserve a husband'. We think that many of the extreme outliers are likely caused by fake recipes like this one. Dropping these outliers temporarily did not affect our analysis here since only about 1% of the recipes have a cooking time of more than 600 minutes. We see that most recipes have a cooking time of under 60 minutes, with 30 minutes being the most common.
We also looked at the distribution of the 'mean_rating'
column. It shows that the reviews on food.com are overwhemingly positive as most of the mean ratings are above 4 and more than half of the mean ratings are 5.
The scatter plot below shows the 'mean_rating'
column vs. the 'saturated_fat'
column. Due to the extremely negatively skewed mean ratings, we cannot conclude any meaningful correlation between 'mean_rating'
and 'saturated_fat'
. We also noticed some outliers in 'saturated_fat'
, which may also be a result of fake recipes like the one mentioned in the above section.
This second scatterplot shows the 'saturated_fat'
column vs. the 'n_steps'
column. Contrary to our intuition, recipes with more steps tend to have lower saturated fat. This could also be due to the impact of outliers as there are many recipes with abnormally high saturated fat content.
Here we see the conditional distribution of 'mean_rating'
for higher vs. lower calories. Again, for the sake of the analysis, we ignore all rows with calories higher than 20,000. We define "low calories" as calories lower than the median and "high calories" as higher than the median. We see that while the mean rating in both categories are still overwhemingly positive, recipes with lower calories seem to have a larger varience in mean rating. This can be seen in the isolated blue bars around mean ratings of 1 - 3. We hypothesize that since the 'fake' recipes tend to have extreme calories, the recipes with lower calories are probably more legitimate, and thus have more meaningful ratings.
Note: the y-axis is in log scale for better visualization.
<iframe src="assets/conditional_logcount.html" width=800 height=600 frameBorder=0></iframe>The histogram below shows the trend of average 'minutes'
when grouped by 'n_ingredients'
. For most of the plot, we can clearly see a positive correlation between average minutes requried for the recipe and the number of ingredients. The trend does not continue for number of ingredients higher than 28, however. This could be simply due to fewer recipes having more than 28 ingredients, thus skewing the mean minutes for those recipes.
We use the merged
DataFrame for the entirety of this section. Here is the count of the missingness in the columns of merged
:
n_missing | |
---|---|
id | 0 |
name | 1 |
description | 114 |
minutes | 0 |
calories | 0 |
total_fat | 0 |
sugar | 0 |
sodium | 0 |
protein | 0 |
saturated_fat | 0 |
carbohydrates | 0 |
n_ingredients | 0 |
n_steps | 0 |
rating | 15036 |
review | 58 |
We believe that the 'description'
column is NMAR because perhaps certain recipes do not have much to descibe, and therefore are left blank. For example, recipes for foods such as cookies or hot chocolate may not require much of an explanation, and thus their recipes are note accompanied by a description. We can collect data on how common each food item is, since we believe that more popular/well-known dishes may not need a description while more uncommon foods, like those that are specific to a culture, may be more likely to require a description.
From the missingness summary above, we notice that the 'rating'
column has a substantial amount of missing values as compared to the 'description'
and 'review'
columns. In this section, we conduct two separate permutation tests to analyze the dependence of the 'rating'
column's missingness on the 'saturated_fat'
column and the 'minutes'
column.
The above plot shows the distribtuion of the 'saturated_fat'
column for missing and non-missing 'rating'
. For this permutation test we analyze if there is dependency between the missingness of the ratings and the saturated fat content. We use the difference in group means as our test statistic, as 'saturated_fat'
is numerical. For this test, we have the following:
- Null Hypothesis: The saturated fat for recipes with missing ratings and recipes with non-missing ratings are drawn from the same distribution (i.e. group_mean(missing) - group_mean(non-missing) = 0).
- Alternate Hypothesis: The mean saturated fat for recipes with missing ratings is greater than that of the recipes with non-missing ratings (i.e. group_mean(missing) - group_mean(non-missing) > 0).
Here is the plot of the results from our permutation test using 2000 permutations:
<iframe src="assets/missing_rating_fat_perm.html" width=800 height=600 frameBorder=0></iframe>We get a p-value of 0.0, which is lower than the significance level 0.05, and therefore we reject the null hypothesis. As such, we can conclude that the missingness in the 'rating'
column is dependent on the 'saturated_fat'
column. In other words, we conclude that ratings is MAR, conditional on saturated fat.
The above plot shows the distribtuion of the 'minutes'
column for missing and non-missing 'rating'
. For this permutation test we analyze if there is dependency between the missingness of the ratings and the minutes each recipe takes to make. We use the difference in group means as our test statistic, as 'minutes'
is numerical. For this test, we have the following:
- Null Hypothesis: The minutes for recipes with missing ratings and recipes with non-missing ratings are drawn from the same distribution (i.e. group_mean(missing) - group_mean(non-missing) = 0).
- Alternate Hypothesis: The minutes for recipes with missing ratings and recipes with non-missing ratings are not drawn from the same distribution (i.e. group_mean(missing) - group_mean(non-missing) != 0).
Here is the plot of the results from our permutation test using 2000 permutations:
<iframe src="assets/missing_rating_min_perm.html" width=800 height=600 frameBorder=0></iframe>We get a p-value of 0.1205, which is greater than the significance level 0.05, and therefore we fail to reject the null hypothesis. As such, we cannot conclude that the missingness in the 'rating'
column is dependent on the 'minutes'
column. In other words, we conclude that ratings is MCAR with respect to the minutes.
Now we return to our research question:
Are recipes that are lower in saturated fat content more popular than recipes that are higher in saturated fat content?
Note that we use the recipes
DataFrame for entirety of this section, specifically the 'n_reviews'
and 'saturated_fat'
columns. For the purposes of this analysis, we define a recipe to be popular if it has received more than the median number of reviews and recipes with less than the median number of reviews are categorized as unpopular; we add this categorization to recipies
in a new column called popularity
. We choose to use the number of reviews as a gauge for the popularity of a recipe instead of the mean rating because the ratings in this dataset are overwhelmingly high, and hence we believe that the mean ratings would not provide us with meaningful results for our question of interest because there is not much variability. The number of reviews allows us to estimate the number of people who attempted to make a specific recipe and we assume that most people evaluate the nutritional information when choosing which recipes to try. Thus, we say that the number of reviews is likely an accurate estimate of a recipe's popularity. Note that in this context, popularity does not equate to a positive review of a recipe, just the number of people attempted it.
Here is a boxplot of the distribution of saturated fat for popular versus unpopular recipes:
<iframe src="assets/popularity_box.html" width=800 height=600 frameBorder=0></iframe>In order to analyze this question, we run a permutation test with difference in group means as our test statistic, since the 'saturated fat'
column is numerical and from the plot above, the shape of the distributions look roughly the same. We choose a significance level of 0.05 and have the following hypotheses:
- Null Hypothesis: The saturated fat content of popular recipes and unpopular recipes are from the same distribution (i.e. group_mean(unpopular) - group_mean(popular) = 0).
- Alternate Hypothesis: The saturated fat content of popular recipes are lower than the saturated fat content of unpopular recipes (i.e. group_mean(unpopular) - group_mean(popular) > 0).
- Test Statistic: mean saturated fat of unpopular recipes - mean saturated fat of popular recipes
Here is the plot of the results from our permutation test using 10,000 permutations:
<iframe src="assets/hyp_test.html" width=800 height=600 frameBorder=0></iframe>We get a p-value of 0.0, which is lower than the significance level of 0.05, and therefore we reject the null hypothesis.
From these results, we infer that many people may be concious of saturated fat content when trying new foods and likely avoid recipes that are high in saturated fat because they are believed to be quite unhealthy. This could also be the result of fake recipes people do not use or review having unreasonable high amounts of saturated fat.