Monthly Temperature Forecasts for Denmark - Statistical or Dynamical?

Monthly forecasts can be computed in two fundamentally different ways: using empirical/statistical methods, or using physical/dynamical models. For some time now, DMI has issued monthly surface air temperature outlooks based on the statistical relation between the monthly mean temperature in Denmark in one month and the monthly mean temperature in the following month. As an example, Table 1 shows the observed relation between January and February mean temperatures in the period 1890-1990. For example, a cold January was followed by a cold February in 68% of the years; only in 4% of the years a cold January was followed by a warm February, i.e. there is a clear tendency for temperature anomalies to persist during January and February. A month is here categorised as ``cold'' when it belongs to the coldest quartile; similarly, a month is ``warm'' when it belongs to the warmest quartile. Thus, the observed distribution of temperature in February after a cold January deviates strongly from the climatological distribution (which is 25% cold, 50% normal and 25% warm).

**Table 1:** Observed relations between monthly mean temperatures in January and February in Denmark in the period 1890-1990 (courtesy P. Steffensen). A month is ``cold'' (``warm'') if it belongs to the coldest (warmest) quartile. Temperatures are obtained as a weighted average of the five Danish NACD stations (Frich et al., 1996).
	February
January	Cold	Normal	Warm
Warm	4%	44%	52%
Normal	14%	65%	22%
Cold	68%	28%	4%

Forecasts based on statistical relations, such as the example shown in Table 1, has the advantage that they are reliable per construction (unless the climate has changed after the period that was used to derive relations between monthly mean temperatures), i.e. the forecast probabilities agree with the observed frequencies of cold, normal and warm months.

The main purpose of the present study is to determine whether the simple statistical monthly forecasts described above can be improved by including additional information from dynamical seasonal or medium-range forecasts.

As an example where the statistical monthly outlook certainly would have benefited from additional forecast information from a 10-day medium-range forecast, consider the monthly outlook for February 1997. Figure 1 shows daily mean temperatures during January and February 1997 for the station Tirstrup. At around 13 January there was a jump in temperature which was associated with a shift in the atmospheric circulation over Europe from blocked to more zonal flow (Woetmann Nielsen and Cappelen, 1997) which lead to milder conditions in Denmark. The relatively mild conditions continued more or less for the rest of the month, and, except for a couple of days in the middle of the month, very mild conditions were dominating throughout February.

**Figure 1:** Four times daily observations (green bars) and five-day running mean (black line) of 2m temperature in 06070 Tirstrup during January and February 1997.
$\begin{figure} \begin{center} \fboxrule=0pt \epsfig{file=/home/hf/Develop/monthly/JF97_obs.eps,width=8.2cm} \end{center}\end{figure}$

**Figure 2:** ECMWF probability forecasts from 31 Jan. 1997 for the 6-10 Feb. mean temperature in 850 hPa (T850). (a) Probabilities for T850 more than 2K colder than normal; (b) probabilities for T850 more than 2K warmer than normal. Contours at 5%, 35%, 65% and 95%.
$\begin{figure} \begin{center} \fboxrule=0pt \epsfig{file=/home/hf/Develop/monthl... .../hf/Develop/monthly/eps_mean_probs_pos.eps,width=8.2cm} \end{center}\end{figure}$

After a brief description in Sect. 2 of the data used in the present study, Sect. 3 describes a systematic investigation of monthly predictive skill associated with different predictors, including observed monthly mean temperature for the previous month as well as monthly and 10-day dynamical simulations of surface air temperature. Section 4 discusses ways to express the forecasts in terms of probabilities, and Sect. 5 contains concluding remarks.

2. Data

The primary source of data for the present study is the ``ECMWF Seasonal Simulations CD-ROM'' set (Becker, 1998) which contains selected data from the ECMWF PROVOST simulations (Brankovic and Palmer, 1999) and from the ECMWF ERA-15 reanalysis project. In the PROVOST project (PRediction Of climate Variations On Seasonal and interannual Timescales) the ECMWF IFS atmospheric model [cycle 13r4, horizontal resolution T63, 31 vertical levels; (Simmons et al., 1988))] was used to generate nine-member ensembles of simulated seasonal climate. Sea surface temperature (SST) was prescribed with observed values. The ensembles are approximately 120 days long, covering the seasons MAMJ, JJAS, SOND and DJFM during the period March 1979 - December 1993. The CD-ROM set contains 10-day means on a global 2.5°×2.5° grid for selected surface and pressure level fields.

The fields that are used in the present study include reanalysed surface temperature (including SST) and simulated max and min 2m temperature. Mean 2m temperature is not included on the CD-ROMs, so when reference in the following is made to simulated monthly mean 2m temperature we actually mean the average of max and min 2m temperature.

The nine members of the ensembles are generated from analysed initial conditions for the last nine days of February, May, August and November. Thus, realistic initial conditions (corresponding to short forecast lead times) are only available for March, June, September and December. For the remaining months of the year the model atmosphere has essentially no memory of the initial conditions (corresponding to forecast lead times of at least one month).

Observed monthly mean 2m temperature in Denmark is taken from the station Tranebjerg (55.9°N, 10.6°E). Homogenised monthly data for Tranebjerg from 1890 - 1990 is available in the ``North Atlantic Climatological Dataset'' (Frich and Coauthors, 1996); monthly mean 2m temperature after 1990 is taken from DMI's climate database.

3. Predictive skill

3.1 Predictors and statistical prediction method

For statistical climate predictions it is not uncommon to have a large set of potential predictors, e.g. a gridded analysis of monthly surface temperatures, where the temperature in each grid point is a potential predictor. It is in general not difficult to construct a linear regression equation that describes the observed relation between a large set of predictors and one predictand (e.g. an observed temperature time series for a single station). However, unless the regression equation describes a physical mechanism that relates the predictand to the set of predictors, it is very likely that such an equation simply represents fitting of random variations (noise) to the predictand, and that the equation, therefore, is useless for prediction of future observations.

This overfitting problem can be reduced by cutting the dimension of the predictor field. Several methods exist that can do this. One class of methods is known as screening regression. A simple version, the so-called forward selection, where predictors are successively added to a linear regression equation until addition of extra predictors no longer leads to a statistically significant reduction of the residual variance (e.g., Conradsen, 1984), was tested for monthly forecasts, but was found to give unsatisfactory results. Another class of methods is based on expansion of the predictor field in terms of ``dominating'' patterns. Linear regression where the predictors are chosen as the first few principal components of the original predictor field was found to give much better results than screening regression, and the results shown later in this section are based on linear regression of the first three principal components of the predictor field.

The obvious dynamical model based forecast for the 2m temperature at a particular location, would be the simulated surface air temperature in the grid box containing that location. However, local orography which is not properly resolved in the global model, may cause the forecast to be in error. Moreover, as mentioned in the previous section, simulated surface air temperature is not readily available on the CD-ROMs. Therefore, in order to predict monthly mean 2m temperature in Denmark, linear regression is applied to the first three principal components of the average of simulated monthly max and min 2m temperature in a grid covering Northern Europe. That is, the statistical method that is used to predict temperature in Denmark when the predictor is a field of previously observed surface temperature in Northern Europe, is also applied to a predictor field of dynamically simulated surface air temperature.

Thus, in the following the predictand is given by observed monthly mean 2m temperature in Tranebjerg, and the predictors include

**Figure 3:** Predictor field grid in Northern Europe.
$\begin{figure} \begin{center} \fboxrule=0pt \epsfig{file=/home/hf/Develop/monthly/smallgrid.eps,width=8.2cm,height=5cm} \end{center}\end{figure}$

3.2 Cross-validation

The derivation of a statistical prediction model requires a model training period, and verification requires an independent validation period. The statistical model is simply a statistical relationship between predictor(s) and predictand(s), e.g. a linear regression equation, which is based on data from the training period. The predictive skill of the model is estimated from predictions or hindcasts in the validation period.

In order to introduce the cross-validation technique consider the case where the first half of a given period is used to derive a model (the training period), and the second half is used to validate the model (the validation period). This approach can easily be supplemented by swapping training and validation periods, so that another model is derived on data from the second half of the total period and validated on the first half of the period. Altogether, the two models provide hindcasts for each year of the total period, and the total series of hindcasts can be verified against observations. This is a special case of a general procedure where the total data period repeatedly is split into training and validation periods, where each hindcast from the resulting models is saved, and the total series of hindcasts finally is verified against observations. This procedure is known as cross-validation (Michaelsen, 1987).

For very short time series the training period should be kept as long as possible, i.e. a training period would typically consist of data from all time steps but one, and the resulting model would be used to make a hindcast for the one point in time which would not be included in the training period.

When verified against observations this standard procedure yields for each predictand a skill score which is a measure of the predictive skill of the applied statistical method (e.g. linear regression) over the data period in question. No assessment is made of the uncertainty of the skill score.

Cross-validated skill scores for statistical models are not always reliable for future predictions, i.e. in unfortunate cases one may be led to believe that a statistical forecast model is more accurate than it actually is. Two things can seriously affect the reliability of the cross-validated skill score: (i) outliers in the predictor and predictand data sets, and (ii) variability in the predictor and predictand data on the same timescale as the length of the data period.

In order to get a first estimate of the uncertainty of the cross-validated skill score for predicted 2m temperature in Tranebjerg, an ensemble of hindcasts are made for the same target month and year. This is achieved by making several hindcasts based on different, shorter training periods (note that a training ``period'' need not consist of data from consecutive points in time), so for a total period of, e.g., length 15, the standard cross-validation training period would have length 14, but if only 13 points in time were included then the training data could be selected in as many ways as is possible to pick 13 different numbers out of 14, i.e. there would be 14 different possible training periods and hence 14 different hindcasts; if only 12 points in time were included in the training data then there would be 91 different possible training periods, etc.

A skill score can be calculated for each of the many possible combinations of hindcasts to yield a distribution of skill scores, and from this distribution a skill score confidence interval can be derived.

3.3 Skill scores

The expected performance is estimated by computing skill scores for a set of hindcasts. Numerous different types of skill scores exist. The choice of skill score depends on what aspects of the forecast is most important. In the following the linear correlation in time between forecast anomalies and observed anomalies is used as skill score. Linear anomaly correlations are simple to compute and widely used as skill scores, but a systematic nonlinear relation between forecasts and observations may not be well reflected in the linear correlation coefficient. Also, the linear correlation is known to be sensitive to outliers in the two data sets that are correlated (Wilks, 1995).

When linear anomaly correlations are computed between cross-validated statistical hindcasts and observations, one should be aware of a couple of potential pitfalls:

Each hindcast and observed anomaly must be with respect to the climatology of the relevant training period.

Cross-validated anomaly correlations underestimate the true forecast skill. The closer to zero the true forecast skill is, the more severe is the underestimation of the cross-validated anomaly correlation. Thus, the cross-validated anomaly correlation can be close to -1 when the true forecast skill is close to zero (Barnston and van den Dool, 1993).

The cross-validation methodology does not apply to persistence hindcasts where the predicted anomaly for the target month is simply the observed anomaly for the preceding month. Therefore, in order to get a fair comparison of hindcast skill scores, persistence hindcasts must be computed in a way that allows cross-validation. So in the following, persistence hindcasts refer to linear regression hindcasts where the predictor is the observed monthly mean temperature for the month preceding the hindcast target month.

3.3.1 Skill score confidence intervals

The total predictor and predictand data sets comprise observed monthly mean 2m temperatures for Tranebjerg for January and February 1980, Jan. and Feb. 1981, ..., Jan. and Feb. 1993, i.e. for a period of 14 years. 13 different hindcasts for Feb. 1980 are made by deriving a linear regression equation for Feb. temperature based on each possible training data set that includes precisely 12 of the 13 temperatures from Jan. 1981-93. This procedure is repeated for hindcasts for each of the remaining years, 1981-93, leading to a total of 14×13=182 hindcasts. The hindcasts for each year can be combined to give a total of 13¹⁴ hindcast time series for the 1980-93 period. A subset of 200 hindcast time series are picked at random, and for each of them the linear correlation with observed Feb. temperatures is computed. A 95%-confidence correlation interval is then simply obtained by finding the 2.5% and 97.5% sample percentiles.

3.3.2 Predictive skill for Zero lead time

The simplest zero-lead forecasts are those based on persistence of the monthly mean temperature anomaly. Figure 4 shows as an example February persistence hindcasts and February observations in the 1980-93 period. The correlation between the hindcast ensemble mean anomalies and observed anomalies is 0.68 which is a high skill score relative to the skill scores obtained for other months. Tables 2-5 summarise anomaly correlation skill scores for linear regression hindcasts for the sets of predictors listed in Sect. 3.1. Simulated temperature in 850 hPa and simulated mean sea level pressure were also tested as predictors, but they were not found to add to the skill of the monthly hindcasts obtained with the predictors listed in Tables 2-5.

**Figure 4:** Hindcasts (marked in red) of February monthly mean 2m temperature in Tranebjerg and verifying observations (black solid line). The hindcasts are based on linear regression of January monthly mean 2m temperature in Tranebjerg; max and min hindcasts are connected to guide the eye.
$\begin{figure} \begin{center} \fboxrule=0pt \epsfig{file=/home/hf/Develop/monthly/hindcasts.eps,width=8.2cm} \end{center}\end{figure}$

**Table 2:** Linear correlations between linear regression hindcasts (ensemble mean) and observations for monthly mean 2m temperature in Tranebjerg for the period 1979/80-1993. Predictors are: (1) the previous month's observed monthly mean 2m temperature in Tranebjerg; (2) analysed surface temperature in Northern Europe; (3) simulated monthly mean 2m temperature in Northern Europe for the target month (nine-member ensemble mean; lead time between 1 and 3 months); (4) combination of above predictors; (5) simulated monthly mean 2m temperature in Northern Europe for the target month (ensemble mean; lead time between 1 and 10 days); (6) simulated 10-day mean 2m temperature in Northern Europe for the first ten days of the target month (ensemble mean of 10-day means; lead time between 1 and 10 days). ``-' indicates no skill.
	December		January		February
Predictor	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.
T2m, obs., Tranebjerg stn.	-	-	0.40	[ 0.37; 0.44]	0.68	[ 0.64; 0.71]
T_sfc, analysed	-	-	-	-	0.71	[ 0.63; 0.76]
T2m, monthly sim., long lead	-	-	0.24	[ 0.11; 0.34]	0.49	[ 0.37; 0.58]
Above combined	-	-	0.14	[ 0.04; 0.24]	0.57	[ 0.50; 0.63]
T2m, monthly sim., short lead	-	-
T2m, 10-day sim., short lead	-	-

**Table 3:** As Table 2 for March, April and May.
	March		April		May
Predictor	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.
T2m, obs., Tranebjerg stn.	0.65	[ 0.60; 0.68]	0.28	[ 0.18; 0.39]	-	-
T_sfc, analysed	0.69	[ 0.64; 0.74]	0.30	[ 0.18; 0.47]	0.36	[ 0.27; 0.43]
T2m, monthly sim., long lead	0.27	[ 0.11; 0.41]	-	-	0.34	[ 0.24; 0.41]
Above combined	0.68	[ 0.61; 0.73]	-	-	0.35	[ 0.26; 0.41]
T2m, monthly sim., short lead	0.64	[ 0.59; 0.69]
T2m, 10-day sim., short lead	0.70	[ 0.67; 0.74]

**Table 4:** As Table 2 for June, July and August
	June		July		August
Predictor	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.
T2m, obs., Tranebjerg stn.	0.59	[ 0.51; 0.65]	0.28	[ 0.16; 0.37]	0.62	[ 0.57; 0.65]
T_sfc, analysed	0.44	[ 0.37; 0.52]	0.20	[ 0.06; 0.32]	0.48	[ 0.41; 0.54]
T2m, monthly sim., long lead	0.58	[ 0.49; 0.62]	0.32	[ 0.18; 0.42]	0.42	[ 0.31; 0.53]
Above combined	0.50	[ 0.42; 0.57]	0.38	[ 0.25; 0.49]	0.49	[ 0.43; 0.54]
T2m, monthly sim., short lead	0.34	[ 0.22; 0.47]
T2m, 10-day sim., short lead	-	-

**Table 5:** As Table 2 for September, October and November
	September		October		November
Predictor	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.
T2m, obs., Tranebjerg stn.	0.25	[ 0.11; 0.33]	-	-	0.13	[ 0.03; 0.26]
T_sfc, analysed	0.30	[ 0.16; 0.37]	-	-	-	-
T2m, monthly sim., long lead	-	-	-	-	-	-
Above combined	0.28	[ 0.13; 0.39]	-	-	-	-
T2m, monthly sim., short lead	-	-
T2m, 10-day sim., short lead	-	-

Although a relatively high skill score is obtained for persistence hindcasts for June, the result is not robust. For other periods than 1979-93 (not shown), the predictive skill is considerably lower. This is not the case for the other predictable months. The relatively wide 95%-confidence interval in Table 4 for the persistence hindcasts for June also gives an indication that the skill score for June is less robust than for the other predictable months.

It is puzzling that the long lead time simulated monthly 2m temperature for June is a better predictor than the short lead time monthly simulation. The difference gives an indication of the level of uncertainty in the estimation of skill scores.

Although the 95%-confidence intervals in the tables give an indication of the skill score uncertainties, they do not take into account all sources of uncertainty and, in particular, they can be poor estimates of skill scores for other periods than for the period for which they are derived.

The dynamical model simulations are carried out with prescribed, observed SST. Over sea the simulated 2m temperature closely resembles the SST, so if simulated 2m temperature over sea grid points were included in the predictor field, the predictor would essentially contain observed temperature for the prediction target month. Consequently, the predictive skill would be largely overestimated when the predictor is simulated 2m monthly mean temperature. In order to reduce this problem, dynamical model simulated 2m temperature is only included from model land points. As the simulated 2m temperature over land can still be affected by SST from adjacent grid points the predictive skill presented in Tables 2-5 for predictors of simulated 2m monthly mean temperature should be regarded as an upper limit.

The short lead time dynamical simulations have not proven more skilful than the statistical predictions, but for the skilful months they provide valuable information about the reliability of the statistical predictions. In order to demonstrate this the predictive skill is re-calculated and shown in Tables 6-7 for each statistical hindcast, but including only those years for which the sign of the hindcast anomaly based on the 10-day simulations agrees (disagrees) with that of the statistical hindcast.

For the March and June hindcasts the skill scores are substantially increased when the 10-day simulations agree with the statistical hindcasts, i.e. for these months more reliable forecasts can be expected when a medium-range forecast agrees with the statistical monthly outlook. Conversely, the skill scores are decreased when the 10-day simulations disagree with the statistical hindcasts. In fact, for March the statistical hindcasts show no predictive skill at all. Note that the skill scores in Tables 6-7 should be treated with some caution as some of them are based on as little as 7 years.

**Table 6:** Linear correlations between linear regression hindcasts and observations for monthly mean 2m temperature in Tranebjerg for the period 1979/80-93, but including only those years for which the sign of the hindcast anomaly based on short-lead 10-day simulations agrees with that of the statistical hindcast. ``-' indicates no skill. No skill assessment is made (and the table entry is left empty) if less than half of the years could be included.
	December		March		June		September
Predictor	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.
T2m, obs., Tranebjerg stn.	-	-	0.79	[ 0.75; 0.82]	0.79	[ 0.70; 0.85]	-	-
T_sfc, analysed	-	-	0.84	[ 0.79; 0.88]			0.19	[ 0.01; 0.29]

**Table 7:** As Table 6, but including only those years for which the sign of the hindcast anomaly based on short-lead 10-day simulations disagrees with that of the statistical hindcast.
	December		March		June		September
Predictor	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.	Corr.	95%-conf.int.
T2m, obs., Tranebjerg stn.			-	-	0.34	[ 0.24; 0.42]
T_sfc, analysed	-	-			0.31	[ 0.24; 0.40]

3.3.3 Predictive skill for lead times between one and three months

**Table 8:** Linear correlations between cross-validated persistence hindcasts and observations of monthly mean 2m temperature in Tranebjerg for the period 1949-98. ``-' indicates no skill.
Lead time	Predictor month
[months]	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
0	0.62	0.60	0.32	-	0.34	-	0.60	0.28	0.19	0.16	0.13	0.45
1	0.49	0.56	0.37	-	-	-	0.26	-	-	-	-	0.34
2	0.33	0.20	0.26	-	-	-	-	-	-	0.19	-	0.21
3	0.23	-	0.21	-	-	-	0.15	-	-	0.22	-	-

Relatively skilful monthly persistence hindcasts with a lead time of one month can be made in winter and spring; for the rest of the year the persistence of monthly temperature anomalies does not extend beyond the month following the predictor month. Due to the longer training period the skill scores in Table 8 are more robust than those in Tables 2-5.

Note that the months indicated in Table 8 are the predictor months so that, e.g., the predictive skill of February hindcasts with zero lead is found for lead time = 0 and predictor month = January. It would perhaps be more natural to show predictive skill for a combination of lead time and predictand month, but by showing the predictive skill as a function of predictor month, as in Table 8, it becomes clear that the predictive skill seems more closely linked to the predictor than to the predictand, so that temperature for certain months is a skilful predictor rather than temperature for certain months being more predictable. For example, for lead times of zero and one month, February is a better predictor month than March, so predictions of April mean temperatures should be based on February mean temperatures (correlation skill 0.56) rather than on March mean temperatures (correlation skill only 0.32).

4. Probability forecasts

Monthly predictability in north-western Europe is low compared to many other regions of the globe. The associated forecast uncertainty is most conveniently expressed in terms of forecast probabilities. For dynamical forecast models the probabilities can be derived from an ensemble of forecasts which can be produced by computing forecasts from a set of different initial conditions. For statistical models it is not so clear how an ensemble distribution translates to forecast probabilities when the ensemble is generated by varying the training period as described in Sect. 3.2. For both model types forecast reliability becomes a problem if the forecasts contain systematic biases. The hindcasts in Fig. 4 provide an example where the deviation between ensemble hindcasts and observations cannot simply be explained by sampling problems; there are years in which the ensemble forecast is nowhere near the observed temperature, e.g. in 1983 and 1986.

It is crucial that probability forecasts are reliable, i.e. over time the forecast probabilities should agree with the corresponding observed frequencies (Wilks, 1995). This can be ensured by estimating the joint probability density function for a set of hindcasts and observations and subsequently deriving the conditional probability density of observed temperature given a specific model forecast as illustrated by the following example.

Consider the observations and the ensemble means of the hindcasts shown in Fig. 4. A scatterplot is shown in Fig. 5. Hindcasts and observations are then both divided into categories such that the coldest 25% define the ``cold'' category, the middle 50% define the ``normal'' category, and the warmest 25% define the ``warm'' category. The categories are indicated in Fig. 5. The 25-50-25 division is an arbitrary choice; terciles are probably more widely used.

Note that the thresholds between categories are not the same for hindcasts and observations. The use of different thresholds is a way to correct for model bias when the predictions are expressed in categorical terms. However, in the present example much of the threshold differences are probably due to the uncertainty associated with a small sample size (only 14 observations).

**Figure 5:** Scatterplot showing pairs of February hindcasts and observations for the 1980-93 verification period. Horizontal and vertical lines indicate the division between ``cold'', ``normal'' and ``warm'' categories. Cf. Fig. 4.
$\begin{figure} \begin{center} \fboxrule=0pt \epsfig{file=/home/hf/Develop/monthly/scatter_new.eps,width=8.2cm} \end{center}\end{figure}$

The joint distribution of hindcast and observation categories can be described by a contingency table such as the one shown in Table 9. Forecast probabilities can easily be derived from the contingency table. If a statistical forecast for February 1994 was, say, -2°C, i.e. ``cold'' then the probabilities according to Table 9 would be 26/45 = 58% for ``cold'', 19/45 = 42% for ``normal'' and 0% for ``warm''.

**Table 9:** Contingency table for February hindcasts and observations. The entries in the table are the number of cases each combination of hindcast and observation occurred during the 1980-93 period. Cf. Fig. 5.
	Observation
Hindcast	Cold	Normal	Warm
Warm	0	23	22
Normal	13	62	17
Cold	26	19	0

A 0% probability that February will be warm is unrealistic. For a larger sample of observations there would almost certainly be no zero entries in the contingency table. The example illustrates that it can be problematic to estimate probabilities from contingency tables that are based on small samples.

If a larger hindcast sample is not available then an alternative approach is to fit the parameters in a two-dimensional parametric probability distribution to the pairs of hindcasts and observations. The natural starting point is to assume a bivariate normal (Gaussian) distribution. Five parameters need to be estimated: mean and variance of hindcasts and observations and the linear correlation between hindcasts and observations, e.g., (Wilks, 1995).

Let T=T₀ be a deterministic model forecast. Then a probability forecast is given by the conditional probability density function f(T | T_model = T₀) which is itself Gaussian with parameters

For the example shown in Fig. 5 we find for the hindcast ensemble means, $\mu_{hind}=0.66$ and $\sigma_{hind}=2.19$ ; for the observations, $\mu_{obs}=0.72$ and $\sigma_{obs}=2.85$ ; and the correlation between hindcast ensemble means and observations, $\rho=0.63$ .

For a forecast, T₀=-2°C, we find the probability density which is shown in Fig. 6 together with the climatological probability density, i.e. the probability density for all observations. ``Cold'', ``normal'' and ``warm'' categories can be defined by finding the 25% and 75% percentiles in the climatological distribution. The thresholds are indicated in Fig. 6. The forecast probabilities are found as the areas under the forecast probability density to be 54.7% for ``cold'', 42.1% for ``normal'' and 3.2% for ``warm'', which happen to be very close to the probabilities that resulted from use of Table 9.

**Figure 6:** Probability density functions (assuming normal distributions) for observed February monthly mean temperature in the 1980-93 period (dashed) and given a deterministic model forecast T₀=-2°C (solid). ``Cold'', ``normal'' and ``warm'' categories are indicated.
$\begin{figure} \begin{center} \fboxrule=0pt \epsfig{file=/home/hf/Develop/monthly/pdfs_new.eps,width=8.2cm} \end{center}\end{figure}$

The parametric approach is appealing, particularly when the joint probability density function is based on a short verification period. The parametric approach is more robust than the approach that is based on a contingency table, but the underlying assumption about normality has not been justified in the above example, and it may indeed prove difficult to establish which parametric distribution to use, particularly when only a short verification period is available.

Figure 7 shows an example of cross-validated persistence probability hindcasts assuming a bivariate normal distribution. In each step of the cross-validation the parameters describing the joint distribution of predictions and observations have been computed from data in the training period. The probabilistic hindcasts are shown as the estimated 80% confidence intervals. The probability hindcasts appear to be well calibrated as the hindcast confidence intervals encompass the observations in 11 of the 14 cases which is very close to the expected 80%.

**Figure 7:** Estimated 80% confidence intervals for cross-validated February probability persistence hindcasts (intervals in red). A bivariate normal distribution of hindcasts and observations is assumed. Observations are shown in black.
$\begin{figure} \begin{center} \fboxrule=0pt \epsfig{file=/home/hf/Develop/monthly/xvalcalibr.eps,width=8.2cm} \end{center}\end{figure}$

5. Concluding remarks

Hindcasts of monthly mean temperature in Denmark show large seasonal variation in predictive skill. Highest predictability is found in February, March, June and August; no predictability is found in October, November and December. High predictability is closely linked to high persistence which, in turn, is probably linked to the influence of local sea surface temperature.

The results demonstrate that the statistical predictions are at least as skilful as the dynamical simulations; in most months they are more skilful.

One of the main motivations for the present study has been to improve the monthly outlooks that DMI issues at the beginning of every month.

While the simple outlooks based only on the temperature in Denmark in the previous month appear to give the highest skill scores in several months of the year, the use of larger predictor data sets can be used to improve the outlooks for some months, particularly in the transition seasons. The relatively poor skill scores for the dynamical simulations (which can reasonably be taken as an upper estimate for the skill of dynamical predictions) suggest that monthly outlooks given the state of the present generation of GCMs should be based primarily on statistical predictions.

For the highly predictable months an operational medium-range forecast might be consulted as an indicator for the reliability of the actual statistical monthly prediction. If the medium-range forecast points in the same direction as the statistical forecast the latter can be expected to be much more reliable than otherwise.

For longer lead times the predictability drops. For a lead time of one month relatively skilful statistical forecasts can still be made for the winter and spring months, and dynamical model simulations appear to show significant skill for the summer months. For lead times beyond one month the skill is generally very low for both statistical and dynamical predictions.

Seasonal climate prediction is an area of intense research these years, and the quality of coupled ocean-atmosphere GCMs is likely to improve substantially over the next decade or so, so although statistical methods presently perform better than dynamical models, one may expect that dynamical monthly and seasonal predictions eventually will beat the statistical methods.

It is crucial that probability forecasts are reliable, i.e. over time the forecast probabilities should agree with the corresponding observed frequencies. If this is not the case for an ensemble forecast then the forecast should be recalibrated, or if the raw model output cannot be described in probabilistic terms then probabilities should be derived in a way that ensures reliable probability forecasts. For the latter case, if the joint probability distribution of predictions and observations can be confidently described in terms of a parametric distribution (Gaussian or some other distribution), then a probability forecast can be based on this parametric distribution by deriving the conditional probability density of observed temperature given the model prediction.

Recalibration of an ensemble forecast is not so straightforward and has not been dealt with in the present study. The simple approach is to apply the above-mentioned conditional probability methodology to the ensemble mean, but as the spread of dynamical model ensembles is likely to give an indication of the reliability of the model (ensemble) forecast, there is a need to take into account more information about the ensemble when the ensemble forecasts are recalibrated. This problem will be pursued in future work.

Based on the findings of this report the following forecast strategy is recommended:

Re 1. For some months, particularly for one month lead time, the forecasts are expected to have no skill. In these cases one can choose to either not issue a forecast or to let the missing skill be reflected in the forecast probabilities.

Re 2. Zero-lead persistence forecasts for May, and possibly for July, are the only persistence forecasts that do not match the predictive skill of more advanced statistical or dynamical forecasts. The performance of dynamical monthly predictions should, however, be monitored continuously as dynamical models are likely to improve over the next few years.

Re 3. The parametric recalibration described in Sect. 4 is recommended provided that a parametric form of the joint probability distribution can be found (e.g. by transforming the temperature distribution so that the transformed variable follows a Gaussian distribution). Otherwise, contingency tables can be used to derived probabilities.

Re 4. For March and June it has been demonstrated (Table 6) that persistence forecasts are more accurate if there is agreement with medium-range forecasts for the first third of the target month. Consequently, the forecast probability density function should be sharpened if the medium-range forecast agrees with the persistence forecast, and it should be broadened if the medium-range forecast disagrees with the persistence forecast

DANISH METEOROLOGICAL INSTITUTE
SCIENTIFIC REPORT
99-8

Monthly Temperature Forecasts for Denmark -
Statistical or Dynamical?

Henrik Feddersen

ISSN 0905-3263
ISSN 1399-1949 (Online)
ISBN 87-7478-396-3

Copenhagen 1999

Abstract:

1. Introduction