What are the most important phyciochemical attributes associated percieved quality of red wine ?
This dataset contains a total of 1599 rows and 12 columns.
Top 5 rows
for aestheticity, the data is transposed is transpose
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
fixed.acidity | 7.400 | 7.800 | 7.800 | 11.200 | 7.400 |
volatile.acidity | 0.700 | 0.880 | 0.760 | 0.280 | 0.700 |
citric.acid | 0.000 | 0.000 | 0.040 | 0.560 | 0.000 |
residual.sugar | 1.900 | 2.600 | 2.300 | 1.900 | 1.900 |
chlorides | 0.076 | 0.098 | 0.092 | 0.075 | 0.076 |
free.sulfur.dioxide | 11.000 | 25.000 | 15.000 | 17.000 | 11.000 |
total.sulfur.dioxide | 34.000 | 67.000 | 54.000 | 60.000 | 34.000 |
density | 0.998 | 0.997 | 0.997 | 0.998 | 0.998 |
pH | 3.510 | 3.200 | 3.260 | 3.160 | 3.510 |
sulphates | 0.560 | 0.680 | 0.650 | 0.580 | 0.560 |
alcohol | 9.400 | 9.800 | 9.800 | 9.800 | 9.400 |
quality | 5.000 | 5.000 | 5.000 | 6.000 | 5.000 |
Each row of the dataset represents a observation of red wine. columns contains various objective phyciochemical attributes of the wines as well as average quality score.
Basic Statistics
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
fixed.acidity | 4.600 | 7.100 | 7.900 | 8.320 | 9.200 | 15.900 |
volatile.acidity | 0.120 | 0.390 | 0.520 | 0.528 | 0.640 | 1.580 |
citric.acid | 0.000 | 0.090 | 0.260 | 0.271 | 0.420 | 1.000 |
residual.sugar | 0.900 | 1.900 | 2.200 | 2.539 | 2.600 | 15.500 |
chlorides | 0.012 | 0.070 | 0.079 | 0.087 | 0.090 | 0.611 |
free.sulfur.dioxide | 1.000 | 7.000 | 14.000 | 15.870 | 21.000 | 72.000 |
total.sulfur.dioxide | 6.000 | 22.000 | 38.000 | 46.470 | 62.000 | 289.000 |
density | 0.990 | 0.996 | 0.997 | 0.997 | 0.998 | 1.004 |
pH | 2.740 | 3.210 | 3.310 | 3.311 | 3.400 | 4.010 |
sulphates | 0.330 | 0.550 | 0.620 | 0.658 | 0.730 | 2.000 |
alcohol | 8.400 | 9.500 | 10.200 | 10.420 | 11.100 | 14.900 |
quality | 3.000 | 5.000 | 6.000 | 5.636 | 6.000 | 8.000 |
The table above provided some high level statistics for each variables in the dataset.
The first step towards understanding the relationship between wine quality and physicochemical attributes is to compute the correlations. However, a large correlation matrix is hard to read and decipher, so I created a visualisation for the correlation matrix.
Visualise Correlation Matrix
This correlation matrix visualisation uses, both size and color saturation to represent magnitude of the correlations. It uses colour hue to represent the direction of the correlations.
Using this correlation matrix, it is easy to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine. Also, I found it interesting and so what surprising that the residual sugar have no correlation with the quality.
I wonder how good a linear model based on this two variables will be.
Dependent variable: | |
quality | |
alcohol | 0.314*** |
(0.016) | |
volatile.acidity | -1.384*** |
(0.095) | |
Constant | 3.095*** |
(0.184) | |
Observations | 1,599 |
R2 | 0.317 |
Adjusted R2 | 0.316 |
Residual Std. Error | 0.668 (df = 1596) |
F Statistic | 370.379*** (df = 2; 1596) |
Note: | p<0.1; p<0.05; p<0.01 |
The R^2 is only 0.31, which is definitely note good enough.
The linear regression is only good if the relationship is actually linear. Given the poor performance of our simple linear model, I need to turn my attention to non-linear relationship.
Next, I will explore each phyciochemical attribute individually. Using visualisation, I hope I can uncover some non-linear relationships.
Some basic statistics to get started.
Basic Statistics
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
---|---|---|---|---|---|---|
fixed.acidity | 4.600 | 7.100 | 7.900 | 8.320 | 9.200 | 15.900 |
volatile.acidity | 0.120 | 0.390 | 0.520 | 0.528 | 0.640 | 1.580 |
citric.acid | 0.000 | 0.090 | 0.260 | 0.271 | 0.420 | 1.000 |
residual.sugar | 0.900 | 1.900 | 2.200 | 2.539 | 2.600 | 15.500 |
chlorides | 0.012 | 0.070 | 0.079 | 0.087 | 0.090 | 0.611 |
free.sulfur.dioxide | 1.000 | 7.000 | 14.000 | 15.870 | 21.000 | 72.000 |
total.sulfur.dioxide | 6.000 | 22.000 | 38.000 | 46.470 | 62.000 | 289.000 |
density | 0.990 | 0.996 | 0.997 | 0.997 | 0.998 | 1.004 |
pH | 2.740 | 3.210 | 3.310 | 3.311 | 3.400 | 4.010 |
sulphates | 0.330 | 0.550 | 0.620 | 0.658 | 0.730 | 2.000 |
alcohol | 8.400 | 9.500 | 10.200 | 10.420 | 11.100 | 14.900 |
quality | 3.000 | 5.000 | 6.000 | 5.636 | 6.000 | 8.000 |
The goal is to explore the factors that have strong influences on wine quality. Hence, it is nature to look at the distribution of quality first.
Histrogram of Wine Quality
Wine Quality Frequency
Quality | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Freq | 0 | 0 | 10 | 53 | 681 | 638 | 199 | 18 |
Most of the wine in our sample are median quality ones with rating around from 5 to 6.
Now, let’s continue explore other variables in the dataset.
Histogram for fix.acidity
The fix acidity has a distribution that is slightly skew to the right. There are a few outliers with very high value.
Density Distribution for fix.acidity
It looks like higher quality wines tend to have higher fixed acidity, as their distribution is further skew to the right when compares to the others.
Boxplots fixed.acidity
It seems both median and 75% quantile increases as quality increases. However, the highest and lowest quality wine is not exactly consistent with this statement.
quality | median | 75% |
---|---|---|
3 | 7.50 | 9.875 |
4 | 7.50 | 8.400 |
5 | 7.80 | 8.900 |
6 | 7.90 | 9.400 |
7 | 8.80 | 10.100 |
8 | 8.25 | 10.225 |
However, t | his could | be caused by much smaller sample size. As we can see the variance of both are much large than the rest. |
Histogram for Volatile Acidity
The distribution of volatile acidity has a bimodal distribution, with the first modal around 0.4 and the second around 0.6. And from the density plot, we can see the second modal is largely contributed by wines with rating 5.
Density Distribution for Volatile Acidity
It is quite clear that higher quality wine tends to have lower volatile acidity.
I wonder if there is any connect between Volatile and Fixed acid. And it seems they are both have some relationship with quality.
Volative Acid vs Fixed Accid
I failed to see any interesting pattern between fixed acidity and volatile acidity, However we can definitely see a divergance between the high and low quality wine.
Histogram for Citric Acid
There are a lot of wine that doesn’t have any citric acid at all. Other than the spike just under 0.5, the distribution appears to be quite uniform until 0.5, where it started to fade off.
Histogram for Citric Acid
Histogram for Citric Acid
Both the density distribution and boxplots all suggests the positive relationship between the citric acid and quality.
Histogram for Residual Suger
Distribution of Residual Suger By Quality Rating
There isn’t much going here. The distributions looks normal. Maybe a little long tail with a few outliers.
Quanlity vs Redisual Sugar (limit from 1 to 4)
I made two boxplot this time. the second one is created for residual.sugar smaller than 3.5. The majority of wine have residual sugar 1.5 o 2.5.
It is somewhat surprising to me that I don’t see any relationship between residual sugar and quality here, as I would personally prefer a bit sweeter taste.
Histogram for Chlorides
Again very standard distribution. But outliers might have caused some problem. Let me try to discard those above 0.2 and see what happens.
Histogram for Chlorides (<0.2)
It still looks normally distributed.
Density for Chlorides
Looks like the lowest quality wine has the largest variance in chlorides content. The distribution for other levels of quality looks almost identical.
Quality vs Chlorides
Similar to residual sugar, there are a few outliers with large chlorides amount in our sample. I generated a second boxplot with 0.2 as the cutoff point.
We can see the median value for chlorides decreased as the quality improves.
Histogram for free.sulfur.dioxide
The distribution is not normal, more like a chi-square.
Density Distributions for free.sulfur.dioxide
Not much can be say about this graph. Things are not very obvious.
Boxplots free.sulfur.dioxide
Similar to the previous one, very hard to really say anything about it. Things are not very obvious.
Histogram for alcohol
This one also looks like chi-square, definitely not normal. I can already see that the higher quality wine (darker fill) are clustering to the right.
Histogram for alcohol
Things are even more obvious now. Higher quality wine tends to have higher alcohol content.
Quality vs alcohol
While it is quite clear, that high-quality wine (with a rating of 7 or 8) tends to have higher alcohol content. It is not so clear for the mid-to-low range. On fact, wine with rating 5 has the lowerest alcohol measured by quantiles.
The correlation between the two is 0.4761663
In the previous section, I found that alcohol, volatile acidity and citric acid seems to be the best predictor of quality. I wonder whether combining those three will give me a good explanation.
First, let me try two variable at a time.
Low-quality wines are located towards the left top part of the plot and High-quality ones towards the bottom right. Using the diverging colour scheme, the separation between low and high quality is quite clear.
Again, quite good seperation between the teal and brown color.
Not as good as the previous too but still you can see the pattern.
Visually, I can’t say there is much improvement in separation then the previous three. We can see the separation of the high-quality and low-quality wine by looking at the colour of the dots.
However, there is still a lot of unexplained variances. When the rating of wine quality is close to each other, it becomes much hard to tell them apart.
Correlation Matrix for red wine attributes
This visualisation provides a very compact visualisation for the correlation between the variables in the dataset. Both size and transparency of the circle are used to encode the magnitude of the correlation. The colour hue is used to represent the direction of the correlation. Using this visualisation, it is obvious to see that alcohol and volatile.acidity has the strongest linear relationship with the quality of wine.
I choose this plot because it: * Draw comparisons. * Enable the reader to digest large amounts of information.
Boxplots of alcohol content for various quality of wine
Given the relatively strong linear correlation ({r cor(wine\(alcohol, wine\)quality)}) that was discovered, I was somewhat surprised to see this box plot. The lowest mean quality. The lowest mean quality appears at the quality rating of 5. It almost seems the positive linear relationship only applies to mid to high-quality wine.
I choose this plot because it, * Identify trends. * Clarify a gap between perception and reality.
Explaining Wine Quality with Alcohol and Volatile Acidity
Using the divergent colour, one can see a clear seperation of lower quality and higher quality wine.
The high quality wine tends to have high alcohol content and low volatile acidity.
However, there is still a lot of unexplained variations. There is not clear seperation bewteen wine that have close rating. For example, you can that the light brown and teal dots scattered somewhat randomly around the center of the plot. It is hard to tell them apart when the rating is close.
I choose this plot because it, * Draw comparisons and; * Explain a complicated finding.
I started out the data exploration by calculating the correlation matrix. In particular, I am interested to see, which variables have the strong linear relationship with the wine quality. I found, among all of the variables, alcohol content and volatile acidity, has the strongly linear relationship with wine quality.
Using this information, I did a simple linear regression on the data set and found that those two variables explain only 31% of the total variation in quality.
At this point, I suspected that there might be some non-linear relationship between quality and some variables. So I plotted one histogram, and one boxplot against quality, for every of the variable. Somewhat to my surprise, I didn’t found any obvious and strong relationship. And despite the strong correlation, I found the relationship between alcohol and quality is not that linear.
Finally, I made some scatterplots with the quality encoded using a divergent colour palette. It is quite clear to me that there is a clear separation between the high (rating above 5) and low quality (rating below 5). However, one can also see a lot of noise.
After the exploration, I think it is quite likely there simply isn’t enough relevant data for actuate prediction. The qualities are measured as the average of ratings by at least three wine experts. While one might argue, when this average is taken from a large number of experts’ ratings, it forms a somewhat objective measurement, thanks to the Central Limit Theorem. However, when there is only a small number of experts’ ratings are in deriving the wine quality, the rating became very subjective and heavily influenced by the preference of the individual judges. This is especially true in our case because the rating is not even derived from the group of experts for all the wines. An expert might systematically wine lower than the others experts or vice versa.
A much better prediction might be possible if more granular details are available in the dataset, for example, each experts rating on each wine instead of the just a simple average. T