Explore Red Wine Dataset

Louis Tian

Key Question

What are the most important phyciochemical attributes associated percieved quality of red wine ?

# A Quick Overview of the Dataset

This dataset contains a total of 1599 rows and 12 columns.

Top 5 rows
for aestheticity, the data is transposed is transpose

1 2 3 4 5
fixed.acidity 7.400 7.800 7.800 11.200 7.400
volatile.acidity 0.700 0.880 0.760 0.280 0.700
citric.acid 0.000 0.000 0.040 0.560 0.000
residual.sugar 1.900 2.600 2.300 1.900 1.900
chlorides 0.076 0.098 0.092 0.075 0.076
free.sulfur.dioxide 11.000 25.000 15.000 17.000 11.000
total.sulfur.dioxide 34.000 67.000 54.000 60.000 34.000
density 0.998 0.997 0.997 0.998 0.998
pH 3.510 3.200 3.260 3.160 3.510
sulphates 0.560 0.680 0.650 0.580 0.560
alcohol 9.400 9.800 9.800 9.800 9.400
quality 5.000 5.000 5.000 6.000 5.000

Each row of the dataset represents a observation of red wine. columns contains various objective phyciochemical attributes of the wines as well as average quality score.

Basic Statistics

Min. 1st Qu. Median Mean 3rd Qu. Max.
fixed.acidity 4.600 7.100 7.900 8.320 9.200 15.900
volatile.acidity 0.120 0.390 0.520 0.528 0.640 1.580
citric.acid 0.000 0.090 0.260 0.271 0.420 1.000
residual.sugar 0.900 1.900 2.200 2.539 2.600 15.500
chlorides 0.012 0.070 0.079 0.087 0.090 0.611
free.sulfur.dioxide 1.000 7.000 14.000 15.870 21.000 72.000
total.sulfur.dioxide 6.000 22.000 38.000 46.470 62.000 289.000
density 0.990 0.996 0.997 0.997 0.998 1.004
pH 2.740 3.210 3.310 3.311 3.400 4.010
sulphates 0.330 0.550 0.620 0.658 0.730 2.000
alcohol 8.400 9.500 10.200 10.420 11.100 14.900
quality 3.000 5.000 6.000 5.636 6.000 8.000

The table above provided some high level statistics for each variables in the dataset.

Correlation Matrix

The first step towards understanding the relationship between wine quality and physicochemical attributes is to compute the correlations. However, a large correlation matrix is hard to read and decipher, so I created a visualisation for the correlation matrix.

Visualise Correlation Matrix

Visualise Correlation Matrix

This correlation matrix visualisation uses, both size and color saturation to represent magnitude of the correlations. It uses colour hue to represent the direction of the correlations.

Using this correlation matrix, it is easy to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine. Also, I found it interesting and so what surprising that the residual sugar have no correlation with the quality.

I wonder how good a linear model based on this two variables will be.

Dependent variable:
quality
alcohol 0.314***
(0.016)
volatile.acidity -1.384***
(0.095)
Constant 3.095***
(0.184)
Observations 1,599
R2 0.317
Adjusted R2 0.316
Residual Std. Error 0.668 (df = 1596)
F Statistic 370.379*** (df = 2; 1596)
Note: p<0.1; p<0.05; p<0.01

The R^2 is only 0.31, which is definitely note good enough.

The linear regression is only good if the relationship is actually linear. Given the poor performance of our simple linear model, I need to turn my attention to non-linear relationship.

Next, I will explore each phyciochemical attribute individually. Using visualisation, I hope I can uncover some non-linear relationships.

# Univariate and Bivariate Plots

Some basic statistics to get started.

Basic Statistics

Min. 1st Qu. Median Mean 3rd Qu. Max.
fixed.acidity 4.600 7.100 7.900 8.320 9.200 15.900
volatile.acidity 0.120 0.390 0.520 0.528 0.640 1.580
citric.acid 0.000 0.090 0.260 0.271 0.420 1.000
residual.sugar 0.900 1.900 2.200 2.539 2.600 15.500
chlorides 0.012 0.070 0.079 0.087 0.090 0.611
free.sulfur.dioxide 1.000 7.000 14.000 15.870 21.000 72.000
total.sulfur.dioxide 6.000 22.000 38.000 46.470 62.000 289.000
density 0.990 0.996 0.997 0.997 0.998 1.004
pH 2.740 3.210 3.310 3.311 3.400 4.010
sulphates 0.330 0.550 0.620 0.658 0.730 2.000
alcohol 8.400 9.500 10.200 10.420 11.100 14.900
quality 3.000 5.000 6.000 5.636 6.000 8.000

Wine Quality

The goal is to explore the factors that have strong influences on wine quality. Hence, it is nature to look at the distribution of quality first.

Histrogram of Wine Quality

Histrogram of Wine Quality

Wine Quality Frequency

Quality 1 2 3 4 5 6 7 8
Freq 0 0 10 53 681 638 199 18

Most of the wine in our sample are median quality ones with rating around from 5 to 6.

Now, let’s continue explore other variables in the dataset.

Fix Acidity

Histogram for fix.acidity

Histogram for fix.acidity

The fix acidity has a distribution that is slightly skew to the right. There are a few outliers with very high value.

Density Distribution for fix.acidity

Density Distribution for fix.acidity

It looks like higher quality wines tend to have higher fixed acidity, as their distribution is further skew to the right when compares to the others.

Boxplots fixed.acidity

Boxplots fixed.acidity

It seems both median and 75% quantile increases as quality increases. However, the highest and lowest quality wine is not exactly consistent with this statement.

quality median 75%
3 7.50 9.875
4 7.50 8.400
5 7.80 8.900
6 7.90 9.400
7 8.80 10.100
8 8.25 10.225
However, t his could be caused by much smaller sample size. As we can see the variance of both are much large than the rest.

Volatile Acidity

Histogram for Volatile Acidity

Histogram for Volatile Acidity

The distribution of volatile acidity has a bimodal distribution, with the first modal around 0.4 and the second around 0.6. And from the density plot, we can see the second modal is largely contributed by wines with rating 5.

Density Distribution for Volatile Acidity

Density Distribution for Volatile Acidity

It is quite clear that higher quality wine tends to have lower volatile acidity.

Volatile Acid vs. Fixed Acid

I wonder if there is any connect between Volatile and Fixed acid. And it seems they are both have some relationship with quality.

Volative Acid vs Fixed Accid

Volative Acid vs Fixed Accid

I failed to see any interesting pattern between fixed acidity and volatile acidity, However we can definitely see a divergance between the high and low quality wine.

Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

There are a lot of wine that doesn’t have any citric acid at all. Other than the spike just under 0.5, the distribution appears to be quite uniform until 0.5, where it started to fade off.

Histogram for Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

Histogram for Citric Acid

Both the density distribution and boxplots all suggests the positive relationship between the citric acid and quality.

Residual Suger

Histogram for Residual Suger

Histogram for Residual Suger

Distribution of Residual Suger By Quality Rating

Distribution of Residual Suger By Quality Rating

There isn’t much going here. The distributions looks normal. Maybe a little long tail with a few outliers.

Quanlity vs Redisual Sugar (limit from 1 to 4)

Quanlity vs Redisual Sugar (limit from 1 to 4)

I made two boxplot this time. the second one is created for residual.sugar smaller than 3.5. The majority of wine have residual sugar 1.5 o 2.5.

It is somewhat surprising to me that I don’t see any relationship between residual sugar and quality here, as I would personally prefer a bit sweeter taste.

Chlorides

Histogram for Chlorides

Histogram for Chlorides

Again very standard distribution. But outliers might have caused some problem. Let me try to discard those above 0.2 and see what happens.

Histogram for Chlorides (<0.2)

Histogram for Chlorides (<0.2)

It still looks normally distributed.

Density for Chlorides

Density for Chlorides

Looks like the lowest quality wine has the largest variance in chlorides content. The distribution for other levels of quality looks almost identical.

Quality vs Chlorides

Quality vs Chlorides

Similar to residual sugar, there are a few outliers with large chlorides amount in our sample. I generated a second boxplot with 0.2 as the cutoff point.

We can see the median value for chlorides decreased as the quality improves.

Free Sulfur Dioxide

Histogram for free.sulfur.dioxide

Histogram for free.sulfur.dioxide

The distribution is not normal, more like a chi-square.

Density Distributions for free.sulfur.dioxide

Density Distributions for free.sulfur.dioxide

Not much can be say about this graph. Things are not very obvious.

Boxplots free.sulfur.dioxide

Boxplots free.sulfur.dioxide

Similar to the previous one, very hard to really say anything about it. Things are not very obvious.

Alcohol

Histogram for alcohol

Histogram for alcohol

This one also looks like chi-square, definitely not normal. I can already see that the higher quality wine (darker fill) are clustering to the right.

Histogram for alcohol

Histogram for alcohol

Things are even more obvious now. Higher quality wine tends to have higher alcohol content.

Quality vs alcohol

Quality vs alcohol

While it is quite clear, that high-quality wine (with a rating of 7 or 8) tends to have higher alcohol content. It is not so clear for the mid-to-low range. On fact, wine with rating 5 has the lowerest alcohol measured by quantiles.

The correlation between the two is 0.4761663

# Multivariate Plots Section

In the previous section, I found that alcohol, volatile acidity and citric acid seems to be the best predictor of quality. I wonder whether combining those three will give me a good explanation.

First, let me try two variable at a time.

Low-quality wines are located towards the left top part of the plot and High-quality ones towards the bottom right. Using the diverging colour scheme, the separation between low and high quality is quite clear.

Again, quite good seperation between the teal and brown color.

Not as good as the previous too but still you can see the pattern.

Now, how about combine all of the three together.

Visually, I can’t say there is much improvement in separation then the previous three. We can see the separation of the high-quality and low-quality wine by looking at the colour of the dots.

However, there is still a lot of unexplained variances. When the rating of wine quality is close to each other, it becomes much hard to tell them apart.

# Final Plots and Summary

Plot One

Correlation Matrix for red wine attributes

Correlation Matrix for red wine attributes

This visualisation provides a very compact visualisation for the correlation between the variables in the dataset. Both size and transparency of the circle are used to encode the magnitude of the correlation. The colour hue is used to represent the direction of the correlation. Using this visualisation, it is obvious to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine.

I choose this plot because it: * Draw comparisons. * Enable the reader to digest large amounts of information.

Plot Two

Boxplots of alcohol content for various quality of wine

Boxplots of alcohol content for various quality of wine

Given the relatively strong linear correlation ({r cor(wine\(alcohol, wine\)quality)}) that was discovered, I was somewhat surprised to see this box plot. The lowest mean quality. The lowest mean quality appears at the quality rating of 5. It almost seems the positive linear relationship only applies to mid to high-quality wine.

I choose this plot because it, * Identify trends. * Clarify a gap between perception and reality.

Plot Three

Explaining Wine Quality with Alcohol and Volatile Acidity

Explaining Wine Quality with Alcohol and Volatile Acidity

Using the divergent colour, one can see a clear seperation of lower quality and higher quality wine.

The high quality wine tends to have high alcohol content and low volatile acidity.

However, there is still a lot of unexplained variations. There is not clear seperation bewteen wine that have close rating. For example, you can that the light brown and teal dots scattered somewhat randomly around the center of the plot. It is hard to tell them apart when the rating is close.

I choose this plot because it, * Draw comparisons and; * Explain a complicated finding.


Reflection

I started out the data exploration by calculating the correlation matrix. In particular, I am interested to see, which variables have the strong linear relationship with the wine quality. I found, among all of the variables, alcohol content and volatile acidity, has the strongly linear relationship with wine quality.

Using this information, I did a simple linear regression on the data set and found that those two variables explain only 31% of the total variation in quality.

At this point, I suspected that there might be some non-linear relationship between quality and some variables. So I plotted one histogram, and one boxplot against quality, for every of the variable. Somewhat to my surprise, I didn’t found any obvious and strong relationship. And despite the strong correlation, I found the relationship between alcohol and quality is not that linear.

Finally, I made some scatterplots with the quality encoded using a divergent colour palette. It is quite clear to me that there is a clear separation between the high (rating above 5) and low quality (rating below 5). However, one can also see a lot of noise.

After the exploration, I think it is quite likely there simply isn’t enough relevant data for actuate prediction. The qualities are measured as the average of ratings by at least three wine experts. While one might argue, when this average is taken from a large number of experts’ ratings, it forms a somewhat objective measurement, thanks to the Central Limit Theorem. However, when there is only a small number of experts’ ratings are in deriving the wine quality, the rating became very subjective and heavily influenced by the preference of the individual judges. This is especially true in our case because the rating is not even derived from the group of experts for all the wines. An expert might systematically wine lower than the others experts or vice versa.

A much better prediction might be possible if more granular details are available in the dataset, for example, each experts rating on each wine instead of the just a simple average. T