Recently, FiveThirtyEight, a popular website that uses statistical analysis to predict sports, politics, and other cultural events, posted a review of how accurate their models have been. The blog talks about *calibration,* which “measures whether, over the long run, events occur about as often as you say they’re going to occur.” Basically, what this means is if a model (or person) predicts that an outcome has a 70% chance of occurring, then it actually occurs at a frequency close to 70%, over a large-enough set of trials.^{[1]} FiveThirtyEight reviewed all their predictions and grouped them, in five percent increments, based on the percentage given to the outcome (i.e., 5-10%, 10-15%, etc). They then compared their predicted probabilities to what actually occurred.

As you could probably guess, their models are fairly well calibrated. This is not surprising given they are in the business of making predictions; and if they were bad at making predictions they probably wouldn’t still be around (survivorship bias) or be posting about it (publication bias)…The same could be said about us!

Oddly, for an industry that routinely makes predictions, investment management is notoriously poor at doing it well. Often, real outcomes do not line up with expectations. One way to judge this is to use the calibration technique used by FiveThirtyEight. We are constantly monitoring and evaluating our quantitative models. One of our methods for judging a model’s efficacy is to compare its live performance to its backtested performance. One metric we employ is a comparison of how we would expect our model to perform, given a market environment, and the model’s actual return under those conditions. We “normalize” the data to account for different market environments, which allows comparisons between favorable and unfavorable markets. In order to set our expectations, we find all the environments during the backtest period that closely match the current environment and create a distribution of possible returns. We then remove the “tails” of the distribution (the top and bottom 10%) to determine our “inline” of expectations bucket. Assuming our model is calibrated well, we should then expect 20% of the model’s actual live returns to fall “outside” expectations (10% above and 10% below). Having a perfectly calibrated model is an unreasonable expectation. Every model is “overfit” or “underfit” to some degree; the question is: how much?

Below are some examples of our short-term counter trend model’s return distributions (our expectations) in a good and bad environment.^{[2]} The returns have been normalized for volatility.^{[3]}

## Good Environment

## Bad Environment

We have about 5 years of live trading results for our current short-term counter trend models (1/1/14 – 3/31/19 or 63 months) and we have traded 14 different equity markets; that equates to 882 data points^{[4]} (63 months X 14 markets). Our actual live results show our model falling just slightly “outside” expectation with 20.75%^{[5]} compared to the 20% we expected; this result suggests our model in not “overfit” or “underfit”. Over time our model has performed as expected, given the market environment, which has led us to greater confidence in its ability to capture the market anomaly it was designed to capture.

Model evaluation metrics, like calibration, can help give investors’ confidence that quantitative investment models are not “overfit” and that the models are capturing what they were designed to capture. This is just one step in the investment due diligence process, but an important one.

**Read more in The Pitfalls of Bias in Betting and Investing >**