Red Wine Discoveries

Steven Diamond
4 min readJun 29, 2020

So, my data science study is complete, but what did I learn. As data scientists, this is the question we need to ask at the completion of every study. Sometimes the answer is surprising, sometimes it’s confounding, sometimes obvious and sometimes just inconclusive. To find nuggets of understanding, we must be open to discovery at each step in the process, so that we can gather insights even when our modeling doesn’t do the job. In fact, a model’s failure can be a critical finding.

Data scientists need to expect and embrace that some of our conclusions are immediately useful but others are impactful as we consider future study. In the study of red wine pricing that I just completed, I learned things from each step in the process, some I can put to use today at the local wine shop and others will be useful for next steps.

Gathering Data
Data availability, in and of itself, can be a learning experience. In this case, I learned that wine data is hard and/or expensive to access. It is also quite spotty when you do find it — very arbitrary as to which wines are reviewed, little agreement on what to call varieties/regions and very inconsistent as to what data is included.

To improve the study in the future, I would need a budget to get access to the better data, which includes several things, like vintage year, that I was not able to be confident in for this study. Perhaps the best thing that I would get from this source is average pricing. A new challenge/opportunity I foresee would be the vast quantity of data that’s available. I would need to decide which wines to exclude and how to deal with non-reviewed wines.

Exploring Data
For most studies, the process of exploratory data analysis uncovers actionable insights and this was certainly the case here. Some of the key learnings included:

  • The distribution of wine pricing has a strong right skew.
  • Wine pricing and review scores are very correlated at scores below 95 but the highest reviews lead to spikes in pricing.
  • Average cost per point study revealed wines that could be a bargain, assuming that you believe in review consistency.
    o Malbec, merlot and primitive wines had low averages.
    o Nebbiolo, cabernet sauvignon & pinot noir had high averages.
    o Regional analysis shows further potential for finding underpriced wines.
  • A natural language processing study revealed words that were more common when discussing wines in certain price ranges.
    o Words associated with higher prices included “oak” and “cabernet.”
    o Words associated with lower prices included “soft” and “fruit.”
  • A cluster analysis on our data was also revealing:
    o Wines imported from France and Italy made up separate clusters.
    o Varieties that were dominant in the dataset (cabernet sauvignon and pinot noir) were the driving force for the four other clusters.

Assuming that future study would continue to focus on wines available in the US, it will be interesting to look for consistency the data. I would also want to see how non-reviewed wine is treated by clustering.

Modeling
For this study, the target variable was the log of the price (this was done to account for the strongly skewed data). We used several different regression techniques to try and create a model which could predict the price of a bottle of wine. The best model was created using Extra Trees Regressor, improving on the baseline model by 22% — 30%. Still, our model was not a great forecaster. The linear regression model, which the extra trees model beat by only 10%, had an R-squared of .58, suggesting that I have a long way to go.

No model is perfect, so data scientists have to make do. In this case, I was able to extract actionable insights from looking at the coefficients from the linear regression, This examination revealed several regions, varieties and countries that have a strongly positive or negative effect on price.

I also uncovered a few other nuggets:

  • Wines designated as “Reserve” or “Grand Cru,” which our data exploration didn’t show to be significantly better (based on reviews), lead to a 7.2% increase in price (assuming all other factors are constant).
  • Each point of review score translates to a 10% increase in price (assuming all other factors are constant).
  • Wine value increases by 0.8% for each year that it ages (assuming all other factors are constant).

Conclusions
So, you can see how a fair to middling model can produce some very interesting insights and set the groundwork for something even better. Hopefully, I can find the funding to explore more in the future. For now, I’ll just have to taste some wine and judge these results for myself.

For the full study and references, please click here.

--

--

Steven Diamond

After spending my career in Marketing and Business Development, I am taking the Data Science Immersive course at GA and looking forward to the next step.