Analyzing white wines by Thorben Schlätzer

Short Intro

This project was written in R and done in R Studio.

This data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each white wine. Regarding the rating of each wine’s quality, at least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). In the following, we will analyze which chemical properties influence the quality of white wines.

Univariate Plots Section

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00
##  Median : 5.200   Median :0.04300   Median : 34.00
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00
##  total.sulfur.dioxide    density             pH          sulphates
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800
##     alcohol         quality
##  Min.   : 8.00   Min.   :3.000
##  1st Qu.: 9.50   1st Qu.:5.000
##  Median :10.40   Median :6.000
##  Mean   :10.51   Mean   :5.878
##  3rd Qu.:11.40   3rd Qu.:6.000
##  Max.   :14.20   Max.   :9.000

Our dataset consists of twelve variables and X as the descriptor for each wine, with almost 5,000 observations.

Acidity variables

We have:

Fixed Acidity variable explains: tartaric acid as g / dm^3
Short description: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

Volatile acidity variable explains: acetic acid as g / dm^3
Short description: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

Citric acid variable explains: citric acid as g / dm^3 Short description: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

At first, all of the acidity variables will be plotted.

##
##  3.8  3.9  4.2  4.4  4.5  4.6  4.7  4.8  4.9    5  5.1  5.2  5.3  5.4  5.5
##    1    1    2    3    1    1    5    9    7   24   23   28   27   28   31
##  5.6  5.7  5.8  5.9    6  6.1 6.15  6.2  6.3  6.4 6.45  6.5  6.6  6.7  6.8
##   71   88  121  103  184  155    2  192  188  280    1  225  290  236  308
##  6.9    7  7.1 7.15  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9    8  8.1  8.2
##  241  232  200    2  206  178  194  123  153   93   93   74   80   56   56
##  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1  9.2  9.3  9.4  9.5  9.6  9.7
##   52   35   32   25   15   18   16   17    6   21    3   11    2    5    4
##  9.8  9.9   10 10.2 10.3 10.7 11.8 14.2
##    8    2    3    1    2    2    1    1

The histogram looks very much distributed betwen slightly under 6 and slightly under 9. Let’s try using the 1% and 99 % - quantile as limiters and a smaller bin width to have a clearer distribution.

Looks like, we have the same problem here. Let’s do the same hing we did with fixed acidity.

This shows us a clear distribution peaking at about 0.26 - 0.28.

Making better visualizations

All in all, the plots show us the distribution, but there is a better way to visualize the different variables.Let us investigate the distribution more closely by removing the first and last 1 %, as well as setting the binwidth to a lower number.

As expected, there is just a very small percentage above 9 g / dm^3 mark and under 5 g / dm^3.

This variable is more skewed to the right. As the summary tells us, the median and the mean are closely together at about 0.26 to 0.28 g/dm^3.

This histogram confirms the earlier distribution, but let us see the one peak at 0.49.

Conclusion for acidity levels

You can see, that the distribution looks slightly similar for each of the acids. The next chapter will show, how the different acids in the wines correlate to each other and how the quality of the white wine is influenced by them.

Residual sugar

variable explains: residual sugar as g / dm^3 short description: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.

Let’s investigate how the variable is distributed:

## [1] 65.8

Looks like there is a strongly right-skewed distribution with only one wine considered as “sweet” (65.8 g/dm^3).

Let’s look at the values that are most common (cutting out the first 1% and the last 1%):

This distribution gives more detailled information about the distribution within the main occurences of residual sugar values.

Chlorides in white wine

variable explains: sodium chloride as g / dm^3 short description: the amount of salt in the wine

You really don’t notice the salt in the wine. Let’s see how the difference in salt there is for a liter of each wine. Let’s use the 1% and 99% quantile of the data to have a better view for the mostommon values.

As we have already seen for more variables, this distribution is skewed to the right and mostly in a range from 0.01 to 0.16. In the next chapter, we will have a look at how the quality is influenced by the salt.

Free sulfur dioxide & total dioxide

Free sulfur dioxide variable explains: free sulfur dioxide as mg / dm^3
short description: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

Total sulfur dioxide variable explains: total sulfur dioxide as mg / dm^3
short description: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

Obviously, the two variables seem to be correlated. Let’s plot the distribution:

This looks roughly normally distibruted, for both variables. In the next chapter, we will discuss what dioxide actually influences quality. Before that, let’s create a variable that contains “other sulfur dioxide” (total - free).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     4.0    78.0   100.0   103.1   125.0   331.0

Seems legit.

Density

variable explains: density as g / cm^3 short description: the density of water is close to that of water depending on the percent alcohol and sugar content.