*by Aaron, 30th October 2017*

This is the first in a series of posts I hope to wrote on model selection and the use of *unit information priors*. I don’t claim any great novelty in this post, I’m just writing up some simple experiments I’m doing to help me develop more intuition. Hope you find it interesting, and don’t hesitate to contact if you have any comments! I want feedback, please :-)

The basic issue that I’m going to tackle is that you have two models and an observed dataset which you know was generated from one of those two models, and you want to estimate (“select”) which of the models was used to generate the dataset.

This post will use some very simple examples, for which there are often other simple methods to estimate (“select”) which model was used, but I focus on using unit information priors as they can often be easily extended to more complex models. I’m currently using these priors in my MCMC algorithm which does clustering in multi-dimensional data; I hope someday to discuss that algorithm on this blog.

We have a sample of \(n\) numbers drawn from distribution \(X\), where \(X\) is either \(H_1\) or \(H_2\). I.e. Normally distributed with mean zero and variance equal to \(1\) or \(2\):

\[ H_1 : x_i \sim \mathcal{N}(0,1) \] \[ H_2 : x_i \sim \mathcal{N}(0,2) \]

Given a vector of observations, \(\mathbf{x}\), we can compute its probability density (*likelihood*) under the two hypotheses in a straightforward manner. \[
\mathrm{P}_1 =
\mathrm{P}(\mathbf{x} | H_1) = \prod_{i=1}^n \mathcal{N}( x_i ; 0,1 )
\] \[
\mathrm{P}_2 =
\mathrm{P}(\mathbf{x} | H_2) = \prod_{i=1}^n \mathcal{N}( x_i ; 0,2 )
\]

where we define the following to compute the density of one observation, \(x\), under a given mean and variance: \[ \mathcal{N}( x ; \mu,\sigma^2 ) = \frac1{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

A very simple approach is to simply say that if \(\mathrm{P}_1{>}\mathrm{P}_2\) then we select model \(1\) as the true model, i.e. \(\sigma^2 = 1\). Otherwise, if \(\mathrm{P}_1{<}\mathrm{P}_2\) then we select model \(2\) as the true model, i.e. \(\sigma^2 = 2\).

More generally, we can compute the ratio of these two probability densities and also define a threshold. When the ratio is greater than the threshold, we select \(H_1\), otherwise we select \(H_2\). \[ \frac{\mathrm{P}_1}{\mathrm{P}_2} \] A higher or lower threshold can be set if you wish to control the error rates in a frequentist manner. Perhaps you want to assume \(H_1\) in general and only select \(H_2\) (reject \(H_1\)) when the evidence is very strong. This would require a very low threshold, much less than \(1.0\).

This is essentially the Neyman-Pearson lemma; when given two *point* hypotheses, you simply compute the probability of the data under both models and compare the two probabilities. The likelihoods are *sufficient statistics* and you can (and should) throw away the data and use only the two *likelihoods* to select a model, along with a threshold that you must select.

For simplicity, I will use a threshold of \(1.0\) in the remainder of this document. In other words, we simply select the model which has the larger likelihood.

\(H_1\) and \(H_2\) are called *point* hypotheses because the all the model parameters (mean and variance) are known and therefore the density can be calculated exactly from the observed data \(\mathbf{x}\)

Before proceeding, Let’s define \(\tau{=}\frac1{\sigma^2}\), the inverse of the *variance*. This is known as the *precision* and is often easier to work with mathematically. We can now refer to a Normal distribution as \(\mathcal{N}(\mu,\tau^{-1})\), which is equivalent to \(\mathcal{N}(\mu,\sigma^2)\).

So - returning to our models - what do we do if \(H_2\) is replaced with \(H_{\tau^{-1}}\), with unknown variance (unknown \(\sigma^2\))?

\[ H_{\tau^{-1}} : x_i \sim \mathcal{N}(0,\tau^{-1}) \]

How do we now select between \(H_1\) and \(H_{\tau^{-1}}\)? In this case, if we don’t know exactly what \(\tau\) is, we cannot exactly compute \(\mathrm{P}_{\tau^{-1}} = \mathrm{P}(\mathbf{x} | H_{\tau^{-1}})\). Given a dataset which we know was drawn from \(H_1\) or \(H_{\tau^{-1}}\), how do we select which of those was the true model?

Given an observed vector \(\mathrm{x}\), we can estimate its precision as the precision, \(\hat\tau\), which maximizes \(\mathrm{P}(\mathbf{x} | H_{\hat\tau^{-1}})\). This is the Maximum Likelihood Estimate (MLE):

\[ \hat\tau = \underset{\tau}{\operatorname{argmax}}{~} \mathrm{P}(\mathbf{x} | H_{\tau^{-1}}) ~~~~~~~~~~~ \] \[ = \underset{\tau}{\operatorname{argmax}}{~} \prod_{i=1}^n \mathcal{N}( x_i ; 0,{\tau^{-1}} ) \] \[ = \frac{n}{ \sum_{i=1}^n x_i^2 } ~~~~~~~~~~~~~~~~~~~~~~~~~ \] (The last expression above is simply the inverse of the variance estimate \(\hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n x_i^2\)

As we can’t compute the exact \(\mathrm{P}_{\tau^{-1}}\), we instead estimate it by using \(\hat\tau^{-1}\).

\[ \mathrm{\hat{P}}_{mle} = \mathrm{\hat{P}}(\mathbf{x} | H_{\hat\tau^{-1}}) = \prod_{i=1}^n \mathcal{N}( x_i ; 0,\hat\tau^{-1} ) \]

*Note that I use \(\mathrm{\hat{P}}\) to emphasize that this is merely an estimate of the probability density.*

It would be tempting now to compare \(\mathrm{\hat{P}}_{mle}\) to \(\mathrm{ P }_1\) and use that to decide whether to select \(H_{\hat\tau^{-1}}\) or \(H_1\). However, that technique is guaranteed to always select \(H_{\hat\tau^{-1}}\).

By construction, the \(\hat\tau^{-1}\) is the value which maximizes the likelihood and therefore the likelihood under \(\sigma^2 = \hat\tau^{-1}\) is guaranteed to be at least as large as that under \(H_1\):

\[ \mathrm{\hat{P}}_{mle} > \mathrm{ P }_1 \] (Equality is not possible in that, as \(\hat\tau^{-1}\) will never be exactly equal to \(1\))

So yes, we do want to estimate \(\mathrm{P}(\mathbf{x} | H_{\tau^{-1}})\), but \(\mathrm{\hat{P}}(\mathbf{x} | H_{\hat\tau^{-1}})\) is a very biased estimate of that and therefore can’t be used as is.

Before proceeding, I should note that the BIC or AIC can be used to adjust these estimates and reduce the bias. Go ahead and read up on those, they are interesting! The BIC is particularly relevant, as it also uses a unit information prior. But this post is about another approach.

How do we resolve this, and use an estimate \(\mathrm{P}(\mathbf{x} | H_{\hat\tau^{-1}})\) that is less biased?

We don’t know the true \(\tau\), but let’s assume we know it was drawn from a Gamma distribution: \[ \tau \sim \mathrm{Gamma}(\alpha,\beta) \]

If \(\alpha\) and \(\beta\) are known, we can compute the integral over \(\tau\): \[ \mathrm{P}_{exact} = \mathrm{P}_\tau(\mathbf{x} ) = \int \mathrm{P}(\mathbf{x} | H_{\tau^{-1}}) \mathrm{P}(\tau) ~\mathrm{d}\tau \]

We can then compare \(\mathrm{P}_{exact}\) to \(\mathrm{P}_1\) as usual. I call it “*exact*” because, if we do know the true distribution from which \(\tau\) was drawn, then we essentially have a point hypothesis again.

A brief note on computation, relevant to understanding the code below: In practice, we compute that integral in the code below via the following equation, with \(\tau\) on the right hand side replaced by an arbitrary value: The denominator here, and the right hand factor in the numerator, are the density of the arbitrary \(\tau\) under, respectively, the Gamma prior and the Gamma posterior. \[ \mathrm{P}(\mathbf{x}) = \frac{ \mathrm{P}(\mathbf{x} | H_{\tau^{-1}}) \mathrm{P}(\tau) }{ \mathrm{P}(\tau|\mathbf{x}) } \] where \(\mathrm{P}(\tau|\mathbf{x})\) is the posterior distribution, i.e. the conditional distribution of \(\tau\) given \(\mathrm{x}\). This distribution is \[ \tau|\mathbf{x} ~~ \sim ~~ \mathrm{Gamma}\left( \alpha + \frac{n}2, \beta + \frac12 \sum_{i=1}^n x_i^2 \right) \]

But we generally don’t know \(\alpha\) or \(\beta\), as we have no model of where \(\tau\) (assuming \(H_{\tau^{-1}}\) is the ‘true’ model) comes from. To decide which values to use in place of these, we first compare the prior and posterior directly: \[ \tau ~~ \sim ~~ \mathrm{Gamma}\left( \alpha, \beta \right) ~~~~~~~~~~~~~~~~~~~~ \] \[ \tau|\mathbf{x} ~~ \sim ~~ \mathrm{Gamma}\left( \alpha + \frac{n}2, \beta + \frac12 \sum_{i=1}^n x_i^2 \right) \]

The effect of the \(n\) observations on \(\alpha\) is to add \(\frac{n}2\) to it. Therefore, the effect of a single observation (*unit information*) is to add \(\frac12\) to \(\alpha\). Similarly, \(\frac1{2n} \sum_{i=1}^n x_i^2\) is added to \(\beta\) by one observation. We define the *zero information prior* as \(\mathrm{Gamma}(0,0)\) (useless as a prior, as it is *improper*). The *unit information prior (UIP)* is then defined as adding the effect of one observation to the zero information prior: \[\tau ~~ \sim ~~ \mathrm{Gamma}\left( \frac12, \frac1{2n} \sum_{i=1}^n x_i^2 \right)\] and therefore we use \(\alpha=\frac12\) and \(\beta = \frac1{2n} \sum_{i=1}^n x_i^2\) as our prior. Note that this prior does indeed depend on the observed data. However, it is a relatively diffuse prior as it’s based on the average effect of one observation instead of the combined effect of \(n\) observations.

To evaluate this method on these two simple models, we will do some simulations. I draw 1000 different \(\tau\) from a \(\mathrm{Gamma}(1,1)\), and then draw a sample (\(n=200\)) for each \(\tau\). I compute four log-likelihoods for each:

- \(\mathrm{P}_1\) - assuming Hypothesis 1 is true, i.e. \(\tau = 1 = \sigma^2\)
- \(\mathrm{P}_2\) - assuming Hypothesis 2 is true, i.e. \(\tau = 0.5; \sigma^2 = 2\)
- \(\mathrm{P}_{mle}\) - assuming the true variance equals the MLE estimate in the sample.
- \(\mathrm{P}_{uip}\) - log-likelihood, after integrating over our unit information prior.
- \(\mathrm{P}_{exact}\) - log-likelihood, after integrating over the “true” prior, \(\mathrm{Gamma}(1,1)\)

In all the plots below, the true \(\tau\) is on the x-axis. The interesting thing in each plot is to see when the does cross the horizontal line at \(y=0\), as that is where the ratio of the (integrated) likelihoods is \(1.0\). This line decides which model has been selected. Below that line means \(H_1\) has been selected. Above means \(H_1\) has been rejected.

On to the code: We define the function `a.few.loglikelihoods`

which takes a sample as input and computes those five log-likelihoods and returns them. In this code, I make heavy use of the `data.table`

package in `R`

. That’s an excellent package, with an excellent concise video tutorial.

```
set.seed(1337)
library(data.table)
library(magrittr) # for %>%
```

```
n=200
a.few.loglikelihoods <- function(x) {
stopifnot(n==length(x))
mle_var = mean(x*x)
P_x_H1 = sum(dnorm(x, mean=0, sd=1, log=T)) # H_1
P_x_H2 = sum(dnorm(x, mean=0, sd=sqrt(2), log=T)) # H_2
P_x_Hmle = sum(dnorm(x, mean=0, sd=sqrt(mle_var), log=T)) # H_mle
# compute the UIP parameters for the Gamma prior
alpha_prior = 1/2 # alpha is 'shape' in {r,g,p}gamma
beta_prior = mle_var / 2 # 1/(2*n) * sum(x^2) # beta is 'rate' in {r,g,p}gamma
alpha_posterior = alpha_prior + n/2
beta_posterior = beta_prior + 1/2 * sum(x^2)
# to compute the integrated likelihood, integrate over all tau,
# we can just take an arbitrary one, as discussed above
tau_arbitrary = 1
P_x_Huip = ( sum(dnorm(x, mean=0, sd=sqrt(1/tau_arbitrary), log=T))
+ dgamma(tau_arbitrary, shape=alpha_prior , rate=beta_prior , log=T)
- dgamma(tau_arbitrary, shape=alpha_posterior, rate=beta_posterior, log=T)
)
# repeat the above with the 'true' prior parameters
alpha_prior = 1
beta_prior = 1
alpha_posterior = alpha_prior + n/2
beta_posterior = beta_prior + 1/2 * sum(x^2)
# to compute the integrated likelihood, integrate over all tau,
# we can just take an arbitrary one, as discussed above
tau_arbitrary = 1
P_x_Hexact = ( sum(dnorm(x, mean=0, sd=sqrt(1/tau_arbitrary), log=T))
+ dgamma(tau_arbitrary, shape=alpha_prior , rate=beta_prior , log=T)
- dgamma(tau_arbitrary, shape=alpha_posterior, rate=beta_posterior, log=T)
)
data.table ( P_x_H1
, P_x_H2
, P_x_Hmle
, P_x_Huip
, P_x_Hexact
)
}
# Generate 1000 taus, from the true prior, Gamma(1,1)
many.taus = rgamma(1000, 1,1)
# For each tau, generate a sample with that precision (1/variance) equal to tau.
# Also, compute all the likelihoods and store it all in a table
lapply(many.taus, function(tau) {
x = rnorm(n, mean=0, sd=sqrt(1/tau))
data.table(tau=tau, a.few.loglikelihoods(x))
}) %>%rbindlist -> d
# order the results by tau
d = d[order(tau)]
# print a few lines from 'd'
print(d)
```

```
## tau P_x_H1 P_x_H2 P_x_Hmle P_x_Huip P_x_Hexact
## 1: 0.001217589 -67856.7186 -34089.5679 -935.51484 -938.31909 -943.41642
## 2: 0.001927188 -45575.8489 -22949.1330 -895.57994 -898.38419 -903.08290
## 3: 0.002053637 -44694.0337 -22508.2254 -893.61816 -896.42241 -901.10154
## 4: 0.002146544 -49571.1073 -24946.7622 -904.01559 -906.81984 -911.60272
## 5: 0.002420676 -42697.4766 -21509.9469 -889.02883 -891.83308 -896.46643
## ---
## 996: 5.948476623 -200.3327 -261.3749 -103.87934 -106.68359 -109.39025
## 997: 6.021871882 -199.0073 -260.7122 -95.52902 -98.33327 -101.45652
## 998: 6.266078584 -198.5073 -260.4622 -92.18880 -94.99305 -98.29419
## 999: 7.201577677 -199.8498 -261.1335 -100.91712 -103.72137 -106.57133
## 1000: 7.251246896 -197.2877 -259.8524 -83.53991 -86.34416 -90.13759
```

Now that the data has been prepared above, we close with a few plots to show which model is selected. Each plot compares \(H_1\) with one of the other four. The final figure confirms that the integrated log-loglikelihood based on the unit information prior, \(\mathrm{P}_{uip}\), doesn’t have the extreme bias of using the MLE, \(\mathrm{P}_{mle}\).

We know that \(\mathrm{P}_{exact}\) is the best possible in this case, as it uses the true distribution of \(\tau\) and therefore the Neyman-Pearson Lemma applies directly. In the last two plots, we can see that \(\mathrm{P}_{exact}\) and \(\mathrm{P}_{uip}\) get very similar results, confirming that \(\mathrm{P}_{uip}\) maintains good accuracy despite not knowing the true prior for \(\tau\), which was \(\mathrm{Gamma}(1,1)\).

In Figure 1, we see that the log-ratio is always positive comparing \(\mathrm{P}_{mle}\) to \(\mathrm{P}_1\), which confirms that \(\mathrm{P}_{mle}\) is not useful; it always rejects \(H_1\) even when \(\tau\) is very close to 1.

```
d[tau >= 0.2 & tau <= 5.0,{
plot( tau , P_x_Hmle - P_x_H1
, xlab = expression(tau) , ylab = expression('log-ratio of ' * P[mle] * ' and ' * P[1])
, log='x', main = 'Figure 1')
abline(h=0, col='grey')
}] %>% invisible
```

The two vertical grey lines at 0.5 and 1.0 represent when \(H_2\) and \(H_1\), respectively, are true. No surprise the log-ratio crosses 0 in between those points.

```
d[tau >= 0.25 & tau < 2.0,{
plot( tau, P_x_H2-P_x_H1
, xlab = expression(tau) , ylab = expression('log-ratio of ' * P[2] * ' and ' * P[1])
, log='x', main='Figure 2'
)
abline(h=0, col='grey')
abline(v=1, col='grey')
abline(v=0.5, col='grey')
}] %>% invisible
```

Using the “exact” prior, which is cheating. As expected, \(H_1\) is selected (y-axis < 0) when \(\tau\) is close to 1.0 and the rejected (y-axis > 0) when \(\tau\) is far from 1.0.

```
d[
tau >= 0.5 & tau < 2.0
,{
y = P_x_Hexact - P_x_H1
plot( tau, y
, xlab = expression(tau)
, ylab = expression('log-ratio of ' * P[exact] * ' and ' * P[1])
, log='x', main='Figure 3'
)
lines(tau, predict( loess( y ~ tau ))
, col='#0000d080'
, lwd=4
)
legend(x='topright',legend='fitted curve from loess()', lwd=4, col='#0000d080')
abline(h=0, col='grey')
abline(v=1, col='grey')
}] %>% invisible
```