Ensemble learning

(nickname Forecast Combination)

Eran Raviv

Talk overview

Question: when facing different models, it is advisable to choose the best performing model.

Option 1 = Agree

Option 2 = Disagree

Some targets are easy to forecast

Sollar Eclipse

Some.. not so much

Lorenz Curve

Some.. evidence is mixed

SNP 500

In an uncertain or dynamic environments we should use all the help we can get

Combining $P$ different models

Simple enough:

\begin{equation} f^{combined} = \frac{\sum_{i = 1}^P f_i }{P} \end{equation}

But should we?

Combining $P$ different models

Yes we should

It works


  • Biases
  • Model risk

Different models perform differently

in different circumstances and/or in different points in time:

Forecasting day-ahead electricity prices: Utilizing hourly prices

Source: Forecasting day-ahead electricity prices

The thing is, going forward we don't know which forecasting model will outperform

So, as we don't bet on the one horse in investments, we don't bet on the one horse here neither

It works in forecasting in the same manner it works when investing

That is the idea, but how to combine? (Down arrow)

Regression based (OLS)

$$ y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t, $$

The combined forecast is then given by:

$$f^{comb} = \widehat{\alpha} + \sum_{i = 1}^P \widehat{\beta}_i f_i,$$

Regression based (LAD)

Train the individual forecasts using:

$$y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t,$$ (as before)

But minimise the absolute loss function:

$$\sum_t |\varepsilon_t|$$ instead of the squared loss function $$\sum_t {\varepsilon_t}^2$$

Regression based (CLS)

Train the individual forecasts using:

$$y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t,$$

Minimise the squared loss function: $$\sum_i {\varepsilon_t}^2,$$ but under additional constraints: $\beta_i \geq 0, \; \forall i, \; \text{or}$ $\sum_{i = 1}^P \beta_i = 1, \; \text{or both} $

Accuracy-based (Inverse MSE)

Use some accuracy measure, for example mean squared error (MSE):

$$ \operatorname {MSE_i} ={\frac {1}{T}}\sum _{t=1}^{T}({{f_{i,t}}} - y_{t})^{2} , $$

and combine the forecasts based on how well each individual is doing:

$$ f^c = \frac{\left(\frac{MSE_{i} }{\sum_{i = 1}^P MSE_{i}}\right)^{-1}}{\sum_{i = 1}^P \left(\frac{MSE_{i} }{\sum_{i = 1}^P MSE_{i}}\right)^{-1} } f_i = \frac{\frac{1}{MSE_{i}}}{\sum_{i=1}^P\frac{1}{MSE_{i}}} f_i $$

Best individual (BI)

Basically (ex-post) model selection  

  $$ f^c = w_i f_i, \quad \mbox{where} \qquad $$  

$w_i = 1 \quad \mbox{if} \quad MSE_{i} < MSE_{-i} \quad \forall i \in \{1, \dots, P\} $  

$ w_i = 0 \quad \mbox{otherwise} $

Housing price forecasting (Example)

There are 14 attributes in each case of the dataset. They are:

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000's

Description of the Boston dataset Source: U.S Census Service

Housing price forecasting

Individual forecasts (RMSE)
Linear: 4.73
Principal component regression: 7.62
Boosting: 3.85
Random forests: 3.26
Support vector machine: 3.06
Neural network: 3.97
Forecast Combinations (RMSE)
Simple: 3.64
OLS : 2.77
LAD: 2.77
Variance based : 3.2
CLS : 2.95
BI: 3.06


GDP measurements

“The current system emphasizes data on spending, but the bureau also collects data on income. In theory the two should match perfectly - a penny spent is a penny earned by someone else. But estimates of the two measures can diverge widely” [Aruoba et al., 2015]

Some discussion

Many familiar techniques can be cast in terms of averaging:

$$ D_t = (1-\lambda) \sum_{t=1}^ \infty \lambda^{t-1} (\varepsilon_{t-1}\varepsilon^ \prime_{t-1}) = (1-\lambda)(\varepsilon_{t}\varepsilon^ \prime_{t})+\lambda D_{t-1} $$

Other ideas

  • Different regimes
  • Dynamic model averaging
  • Why not use it?

    1. Interpretation is lost
    2. Does not always add value (garbage in $\Rightarrow$ garbage out)

    Why use it?

    1. Good "hedge" against wrong modelling choices
    2. No consensus on the best approach
    3. Simple average is very robust
    4. Useful in changing environment where structural breaks are likely

    Thank you