## Talk overview

• Introduction
• Combination schemes
• Examples
• Discussion and takeaways

### Question: when facing different models, it is advisable to choose the best performing model.

Option 1 = Agree

Option 2 = Disagree

# In an uncertain or dynamic environments we should use all the help we can get

## Combining $P$ different models

Simple enough:

$$f^{combined} = \frac{\sum_{i = 1}^P f_i }{P}$$

But should we?

## Combining $P$ different models

Yes we should

It works

• Biases
• Model risk

### Different models perform differently

in different circumstances and/or in different points in time:

## The thing is, going forward we don't know which forecasting model will outperform

So, as we don't bet on the one horse in investments, we don't bet on the one horse here neither

It works in forecasting in the same manner it works when investing

That is the idea, but how to combine? (Down arrow)

## Regression based (OLS)

$$y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t,$$

The combined forecast is then given by:

$$f^{comb} = \widehat{\alpha} + \sum_{i = 1}^P \widehat{\beta}_i f_i,$$

Train the individual forecasts using:

$$y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t,$$ (as before)

But minimise the absolute loss function:

$$\sum_t |\varepsilon_t|$$ instead of the squared loss function $$\sum_t {\varepsilon_t}^2$$

## Regression based (CLS)

Train the individual forecasts using:

$$y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t,$$

Minimise the squared loss function: $$\sum_i {\varepsilon_t}^2,$$ but under additional constraints: $\beta_i \geq 0, \; \forall i, \; \text{or}$ $\sum_{i = 1}^P \beta_i = 1, \; \text{or both}$

## Accuracy-based (Inverse MSE)

Use some accuracy measure, for example mean squared error (MSE):

$$\operatorname {MSE_i} ={\frac {1}{T}}\sum _{t=1}^{T}({{f_{i,t}}} - y_{t})^{2} ,$$

and combine the forecasts based on how well each individual is doing:

$$f^c = \frac{\left(\frac{MSE_{i} }{\sum_{i = 1}^P MSE_{i}}\right)^{-1}}{\sum_{i = 1}^P \left(\frac{MSE_{i} }{\sum_{i = 1}^P MSE_{i}}\right)^{-1} } f_i = \frac{\frac{1}{MSE_{i}}}{\sum_{i=1}^P\frac{1}{MSE_{i}}} f_i$$

## Best individual (BI)

Basically (ex-post) model selection

$$f^c = w_i f_i, \quad \mbox{where} \qquad$$

$w_i = 1 \quad \mbox{if} \quad MSE_{i} < MSE_{-i} \quad \forall i \in \{1, \dots, P\}$

$w_i = 0 \quad \mbox{otherwise}$

## Housing price forecasting (Example)

There are 14 attributes in each case of the dataset. They are:

1. CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS - proportion of non-retail business acres per town.
4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX - nitric oxides concentration (parts per 10 million)
6. RM - average number of rooms per dwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
10. TAX - full-value property-tax rate per $10,000 11. PTRATIO - pupil-teacher ratio by town 12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 13. LSTAT - % lower status of the population 14. MEDV - Median value of owner-occupied homes in$1000's

Description of the Boston dataset Source: U.S Census Service

## Housing price forecasting

 Individual forecasts (RMSE) Linear: 4.73 Principal component regression: 7.62 Boosting: 3.85 Random forests: 3.26 Support vector machine: 3.06 Neural network: 3.97 ——————————————————————————————– Forecast Combinations (RMSE) Simple: 3.64 OLS : 2.77 LAD: 2.77 Variance based : 3.2 CLS : 2.95 BI: 3.06

## GDP measurements

“The current system emphasizes data on spending, but the bureau also collects data on income. In theory the two should match perfectly - a penny spent is a penny earned by someone else. But estimates of the two measures can diverge widely” [Aruoba et al., 2015]

## Some discussion

### Many familiar techniques can be cast in terms of averaging:

$$D_t = (1-\lambda) \sum_{t=1}^ \infty \lambda^{t-1} (\varepsilon_{t-1}\varepsilon^ \prime_{t-1}) = (1-\lambda)(\varepsilon_{t}\varepsilon^ \prime_{t})+\lambda D_{t-1}$$

## Other ideas

• Different regimes
• Dynamic model averaging
• ## Why not use it?

1. Interpretation is lost
2. Does not always add value (garbage in $\Rightarrow$ garbage out)

## Why use it?

1. Good "hedge" against wrong modelling choices
2. No consensus on the best approach
3. Simple average is very robust
4. Useful in changing environment where structural breaks are likely