Analyzing white wines by Thorben Schlätzer

Short Intro

This project was written in R and done in R Studio.

This data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each white wine. Regarding the rating of each wine’s quality, at least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). In the following, we will analyze which chemical properties influence the quality of white wines.

Univariate Plots Section

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00
##  Median : 5.200   Median :0.04300   Median : 34.00
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00
##  total.sulfur.dioxide    density             pH          sulphates
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800
##     alcohol         quality
##  Min.   : 8.00   Min.   :3.000
##  1st Qu.: 9.50   1st Qu.:5.000
##  Median :10.40   Median :6.000
##  Mean   :10.51   Mean   :5.878
##  3rd Qu.:11.40   3rd Qu.:6.000
##  Max.   :14.20   Max.   :9.000

Our dataset consists of twelve variables and X as the descriptor for each wine, with almost 5,000 observations.

Acidity variables

We have:

Fixed Acidity variable explains: tartaric acid as g / dm^3
Short description: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

Volatile acidity variable explains: acetic acid as g / dm^3
Short description: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

Citric acid variable explains: citric acid as g / dm^3 Short description: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

At first, all of the acidity variables will be plotted.

##
##  3.8  3.9  4.2  4.4  4.5  4.6  4.7  4.8  4.9    5  5.1  5.2  5.3  5.4  5.5
##    1    1    2    3    1    1    5    9    7   24   23   28   27   28   31
##  5.6  5.7  5.8  5.9    6  6.1 6.15  6.2  6.3  6.4 6.45  6.5  6.6  6.7  6.8
##   71   88  121  103  184  155    2  192  188  280    1  225  290  236  308
##  6.9    7  7.1 7.15  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9    8  8.1  8.2
##  241  232  200    2  206  178  194  123  153   93   93   74   80   56   56
##  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##   52   35   32   25   15   18   16   17    6   21    3   11    2    5    4
##  9.8  9.9   10 10.2 10.3 10.7 11.8 14.2
##    8    2    3    1    2    2    1    1

The histogram looks very much distributed betwen slightly under 6 and slightly under 9. Let’s try using the 1% and 99 % - quantile as limiters and a smaller bin width to have a clearer distribution.

Looks like, we have the same problem here. Let’s do the same hing we did with fixed acidity.

This shows us a clear distribution peaking at about 0.26 - 0.28.

Making better visualizations

All in all, the plots show us the distribution, but there is a better way to visualize the different variables.Let us investigate the distribution more closely by removing the first and last 1 %, as well as setting the binwidth to a lower number.

As expected, there is just a very small percentage above 9 g / dm^3 mark and under 5 g / dm^3.

This variable is more skewed to the right. As the summary tells us, the median and the mean are closely together at about 0.26 to 0.28 g/dm^3.

This histogram confirms the earlier distribution, but let us see the one peak at 0.49.

Conclusion for acidity levels

You can see, that the distribution looks slightly similar for each of the acids. The next chapter will show, how the different acids in the wines correlate to each other and how the quality of the white wine is influenced by them.

Residual sugar

variable explains: residual sugar as g / dm^3 short description: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

Let’s investigate how the variable is distributed:

## [1] 65.8

Looks like there is a strongly right-skewed distribution with only one wine considered as “sweet” (65.8 g/dm^3).

Let’s look at the values that are most common (cutting out the first 1% and the last 1%):

This distribution gives more detailled information about the distribution within the main occurences of residual sugar values.

Chlorides in white wine

variable explains: sodium chloride as g / dm^3 short description: the amount of salt in the wine

You really don’t notice the salt in the wine. Let’s see how the difference in salt there is for a liter of each wine. Let’s use the 1% and 99% quantile of the data to have a better view for the mostommon values.

As we have already seen for more variables, this distribution is skewed to the right and mostly in a range from 0.01 to 0.16. In the next chapter, we will have a look at how the quality is influenced by the salt.

Free sulfur dioxide & total dioxide

Free sulfur dioxide variable explains: free sulfur dioxide as mg / dm^3
short description: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

Total sulfur dioxide variable explains: total sulfur dioxide as mg / dm^3
short description: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

Obviously, the two variables seem to be correlated. Let’s plot the distribution:

This looks roughly normally distibruted, for both variables. In the next chapter, we will discuss what dioxide actually influences quality. Before that, let’s create a variable that contains “other sulfur dioxide” (total - free).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     4.0    78.0   100.0   103.1   125.0   331.0

Seems legit.

Density

variable explains: density as g / cm^3 short description: the density of water is close to that of water depending on the percent alcohol and sugar content.

This histogram plots a distribution that almost looks normally distributed - slightly right skewed.

pH-value

variable explains: pH on a scale of 1 to 14. short description: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic)

Most wines seem to be between 3-4 on the pH scale. This leads us not to put the pH-value into categories like it is suggested, because the range where the most values are placed in, are not even close to the range beginning and end.

Sulphates

variable explains: potassium sulphate as g / dm3 short description: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

This is a right-skewed distribution where the first 1% and the last 1% are cut out. The next chapter will analyze the relation between sulphates and dioxides.

Alcohol

variable explains: alcohol as % by volume short description: the percent alcohol content of the wine

This distribution also is right skewed. This will be analyzed further in the next chapter.

Quality

Last, but definitely not least, the quality will be presented as very important factor fot our analysis.

variable explains: quality (score between 0 and 10) short description: At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent)

Seems that there is no quality of 1,2 and 10 given. The most popular number given or quality is number 6.

Let’s create a new variable containing the description of the quality of wine. I chose the following categories:

3,4,5 = bad 6 = normal 7,8,9 = good

##    bad normal   good
##   1640   2198   1060

Knowing what the variables are about, we can dive deeper into analyzing the dependencies between the variables more closely…

Univariate Analysis

What is the structure of your dataset?

The dataset consists of twelve variables with almost 5,000 white wines.

What is/are the main feature(s) of interest in your dataset?

The main feature to take a look at is the quality, which is the rating of experts from 1 to 10. The goal is here to take a look at the factors hat influence the quality of the wine. The ingredients (the other variables) often have almost normal distributions, most of the skewed to the right.

The end goal is to be able to predict the white wine quality by only having the ingredients of it.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

There are some ingredients (e.g. the different acid levels) that seem to be distributed in the same way. If we find a high correlation, we could use more than 1 variable as a factor and look how hey in total influence the quality of wines.

Did you create any new variables from existing variables in the dataset?

I created “other sulfur dioxide” as a new variable to be able to see the effect of other sulfur dioxides on the quality of wine.

Bivariate Plots Section

Table of correlations

At first, let’s see what correlations the dataset includes all together. I found the library “PerformanceAnalytics” on www.sthda.com. It throws out a nice graphic analyzing the correlations.

There’s another way to show the strength of each correlation.

We can easily see the correlations of each variable to another. Let’s dive deeper into analyzing the interdependence between different variables.

Relationship between acidity and quality

The first part of the correlation matrix shows a moderately negative correlation between volatile acidity and quality. Let’s look more closely.

You cannot really see the negative correlation. Let’s try the variable containing all acid levels.

This doesn’t represent the correlation either. The means and distributions seem to less scattered though. This variable might be a better choice for a predictive model.

Correlation between density and quality

This correlation seem to be more obvious, we can see that the density gets slightly lower with higher quality. Also he range of density levels gets smaller.

Correlation of alcohol concentration and quality

The alcohol concentration for different qualities of wine seem to vary a lot, whereas the higher quality wines seem to have rather higher numbers in between 10 and 13,5 %. But you can’t really say there’s a specific concentration giving hint about the quality of white wines.

Correlation of other sUlfur dioxide and quality

The last thing seeming to have a rather higher impact on quality seems to be the ingredient we extracted as a variable “other sulfur dioxide”. Let’s look at the amount of other sulfur dioxide and its impact on the quality.

The range for the amount of other sulfur dioxides seem to become smaller the higher the quality gets. In the next chapter we will take a look at all the sulfur dioxides the dataset holds to see the whole impact.

Other sulfur dioxides and free sulfur dioxides

Now we want to take a look at how the free sulfur dioxides are correlated with the other sulfur dioxides to get an idea, if there’s a dependency in the use of different sulfur dioxides.

##
##  Pearson's product-moment correlation
##
## data:  wines$other.sulfur.dioxide and wines$free.sulfur.dioxide
## t = 19.116, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2372821 0.2894077
## sample estimates:
##       cor
## 0.2635373

Seems like there is a small correlation of using sulfur dioxides. But it doesn’t give clear evidence.

Impacts on quality

Let’s summarize the chapter with the representation of different impacts on quality by showing the variables in boxplots on different axes:

As you can see, we have relatively clear differences for only a few variables. It seems that especially alcohol and density have a big impact on white wine quality.

All Acids vs. pH

## # A tibble: 6 x 4
##      pH acid_mean acid_median     n
##   <dbl>     <dbl>       <dbl> <int>
## 1  2.72     10.5        10.5      1
## 2  2.74     10.4        10.4      1
## 3  2.77     10.6        10.6      1
## 4  2.79      8.64        8.98     3
## 5  2.8       8.72        8.6      3
## 6  2.82      7.61        7.61     1

You can see, that the higher the pH-value, the lower the acid mean gets. That shows the great impact, the acid levels have on the pH value.

Residual sugar vs density

##
##  Pearson's product-moment correlation
##
## data:  wines$residual.sugar and wines$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor
## 0.8389665

You can see that the density level rises, the more sugar the wine has. In the multivariate analysis part we will take a look at the distribution of different quality in this constellation.

Total sulfur dioxide vs density

##
##  Pearson's product-moment correlation
##
## data:  wines$total.sulfur.dioxide and wines$density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5094349 0.5497297
## sample estimates:
##       cor
## 0.5298813

Seems a lot more scattered with the same trend though.

Density vs. alcohol

##
##  Pearson's product-moment correlation
##
## data:  wines$density and wines$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor
## -0.7801376

There is a clear trend as well. Now we know the biggest impacts on density within our dataset.

Bivariate Analysis

How did the feature(s) of interest vary with other features in
the dataset?

The clearest observation about the dataset is that there are a lot of factors that have an impact on the quality of white wines.

To name the four main factors :

acidity
There are three different variables that describe the acidity of white wines. Volatile acidity has the highest negative correlation with quality. But it is still quite moderate.

sugar
Having a small impact - in the correlation matrix - on the quality itself, you can clearly see that good wines are less spreaded in its concentration of sugar, rather having less sugar as ingredient. It also seems to affect the denisty a lot, which is one of the biggest factors for whine quality.

density
The density of white wines itself already depends on different variables (see above) and has a moderate impact on the quality of white wine.

alcohol
The strongest correlation of quality has alcohol. The more alcohol the wine has, the better it seems to be.

We obviously do not have THE factor that can predict the quality of wine. Instead, we look at a range of factors - that have interdependencies as well (e.g. sugar/density, alcohol/density, sulfur dioxides/density).

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Te relationship between density and other varibales of the dataset show interdependencies as well. This is important to know to be able to predict quality levels in the end.

What was the strongest relationship you found?

The strongest relationship for the quality of wine seems to be the percentage of alcohol. But as I said, you have to include multiple variables to be able to make predictions for wine quality.

Multivariate Plots Section

After we saw the interdependencies within the dataset, we can analyze more than two variables together.

Density, Alcohol and quality

There seems to be a trend, that the quality rises when the density is lower and the alcohol gets higher.

Density/Sugar and quality

This plot shows us that lower quality wines rather have a higher density which is raised by the concentration of sugar.

Alcohol, residual sugar and Density

## [0.987,0.992] (0.992,0.995]  (0.995,1.04]
##          1667          1600          1631
##    low medium   high
##   1667   1600   1631

You can see that the density is suited best in combination with higher sulfur dioxide.

Mix of sugar and alcohol

We have already seen that sugar and alchohol have an impact on white wine quality. Let’s see if there is a special pattern.

This plot doesn’t give us a perfect concentration of each sugar and alcohol. You can see, though, that the low quality wines are distributed in the lower area of the plot, where the alcohol concentration is low - whereas the sugar concentration is spread form 0 to almost 20. The higher quality wines seem to be distributed in the upper left corner of the plot - with higher alcohol and lower sugar concentration.

pH value and its impact

## [2.72,3.12] (3.12,3.24] (3.24,3.82]
##        1709        1622        1567
##    low medium   high
##   1709   1622   1567

Completely no chance to find out something about the concentration of sulfur dioxides and acids for the pH value.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Definitely strengthening is the fact, that the alcohol concentration is a huge factor for wine quality - as well as density. Looking from different point of views, by taking into account more than two variables, you can see a clear picture how big the roles of density and alcohol are. But at the same time, alcohol seems to be a factor for density as well.

You could observe that sugar has an impact on whine quality as well, since it is linked to density as well.

Were there any interesting or surprising interactions between features?

Interesting was he interaction of sugar, alcohol and density. Quite surprising for me is the fact, that the pH value doesn’t have that of a great impact on the wine quality. I thought it would be more.


Final Plots and Summary

Plot One

Description One

I have picked this graphic to demonsrate which factors have the biggest impact on quality of white wine. Seems like alcohol is a big factor. Probably most non-experts would agree on that as well ;-) The next graphic shows that this is not the whole truth, because there are some interdependencies between variables as well.

Plot Two

Description Two

As you can see, the density differentiates by alcohol and residual sugar. Looks like there are the two main factors for the density. Taking that into account, we can say, that the desnity itself shouldn’t be rated that much important fora predictive model.

Plot Three

Description Three

This graphic sums up all the main aspects we can find out about the impact on quality of white wines. We can see that the high quality wines rather have high alcohol concentraion and less suagr, whereas the low quality wines are the opposite.


Reflection

After seeing the many variables that impact quality of wine, it was really hard to find out the correlations between them. I found the main factors that impact the quality of white wine, but there is still a lot we can find out about. There are a few more variables we could take into account. The next step would be to create a predictive model considering the interactions between the variables of the dataset.

Important decisions I make in the analysis was to find out the main variables that affect the quality of white wines, analyze how they are influencing the quality and then finding interdependicies to other variables in the data set. It was also good to create factor variables to be able to make plots for different qualities of white wines to be able to differentiate what a good wine and what a bad wine is about.

Difficulties arose when I had to choose the right colors and the right alpha values as well as choosing the right axes, especially when I wanted to find out the most influencing variables. I first used the quality on the y-axis, which made most sense for me at first. Then using boxplots and quality on the x-axis made better visualisations and could more easily identify signs for correlations.

Learning R was very hard at first (because of the very different syntax in comparison to python, JavaScript or others), but then had a lot of patterns you could easily remember and use to quickly have results.

All in all, the observations I have made are already showing you insights, though there is still a lot to find out.