# How to Judge a Quantitative Model

Recently, FiveThirtyEight, a popular website that uses statistical analysis to predict sports, politics, and other cultural events, posted a review of how accurate their models have been. The blog talks about calibration, which “measures whether, over the long run, events occur about as often as you say they’re going to occur.” Basically, what this means is if a model (or person) predicts that an outcome has a 70% chance of occurring, then it actually occurs at a frequency close to 70%, over a large-enough set of trials.[1] FiveThirtyEight reviewed all their predictions and grouped them, in five percent increments, based on the percentage given to the outcome (i.e., 5-10%, 10-15%, etc). They then compared their predicted probabilities to what actually occurred.

As you could probably guess, their models are fairly well calibrated. This is not surprising given they are in the business of making predictions; and if they were bad at making predictions they probably wouldn’t still be around (survivorship bias) or be posting about it (publication bias)…The same could be said about us!

Oddly, for an industry that routinely makes predictions, investment management is notoriously poor at doing it well. Often, real outcomes do not line up with expectations. One way to judge this is to use the calibration technique used by FiveThirtyEight. We are constantly monitoring and evaluating our quantitative models. One of our methods for judging a model’s efficacy is to compare its live performance to its backtested performance. One metric we employ is a comparison of how we would expect our model to perform, given a market environment, and the model’s actual return under those conditions. We “normalize” the data to account for different market environments, which allows comparisons between favorable and unfavorable markets. In order to set our expectations, we find all the environments during the backtest period that closely match the current environment and create a distribution of possible returns. We then remove the “tails” of the distribution (the top and bottom 10%) to determine our “inline” of expectations bucket. Assuming our model is calibrated well, we should then expect 20% of the model’s actual live returns to fall “outside” expectations (10% above and 10% below). Having a perfectly calibrated model is an unreasonable expectation. Every model is “overfit” or “underfit” to some degree; the question is: how much?

Below are some examples of our short-term counter trend model’s return distributions (our expectations) in a good and bad environment.[2] The returns have been normalized for volatility.[3]