# [Aaron McDaid](https://aaronmcdaid.github.io/)
Statistics and C++, and other fun projects
`aaron.mcdaid@gmail.com`
[`@aaronmcdaid`](https://twitter.com/aaronmcdaid)

#
***by [Aaron](https://aaronmcdaid.github.io/), 30th October 2017***
[(code for this document)](index.Rmd)

#
This is the first in a series of posts I hope to wrote on model selection and the use of
[*unit information priors*](https://stats.stackexchange.com/questions/26329/what-is-a-unit-information-prior).
I don't claim any great novelty in this post, I'm just writing up some simple experiments I'm doing to help me develop more intuition.
Hope you find it interesting, and don't hesitate to contact if you have any comments! I want feedback, please :-)
The basic issue that I'm going to tackle is that you have two models and an observed dataset which you know was generated from one of those two models,
and you want to estimate ("select") which of the models was used to generate the dataset.
This post will use some very simple examples, for which there are often other simple methods
to estimate ("select") which model was used, but I focus on using unit information priors as
they can often be easily extended to more complex models.
I'm currently using these priors in [my MCMC algorithm](https://aaronmcdaid.github.io/demos/mvnMCMC/mvnMCMC.html)
which does clustering in multi-dimensional data; I hope someday to discuss that algorithm on this blog.
## The Neyman-Pearson lemma
We have a sample of $n$ numbers drawn from distribution $X$, where $X$ is either
$H_1$ or $H_2$. I.e. Normally distributed with mean zero and variance equal to $1$ or $2$:
$$ H_1 : x_i \sim \mathcal{N}(0,1) $$
$$ H_2 : x_i \sim \mathcal{N}(0,2) $$
Given a vector of observations, $\mathbf{x}$, we can compute its probability density (*likelihood*) under the two hypotheses
in a straightforward manner.
$$
\mathrm{P}_1 =
\mathrm{P}(\mathbf{x} | H_1) = \prod_{i=1}^n \mathcal{N}( x_i ; 0,1 )
$$
$$
\mathrm{P}_2 =
\mathrm{P}(\mathbf{x} | H_2) = \prod_{i=1}^n \mathcal{N}( x_i ; 0,2 )
$$
where we define the following to compute the density of one observation, $x$, under a given mean and variance:
$$
\mathcal{N}( x ; \mu,\sigma^2 ) = \frac1{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
$$
A very simple approach is to simply say that if $\mathrm{P}_1{>}\mathrm{P}_2$
then we select model $1$ as the true model, i.e. $\sigma^2 = 1$.
Otherwise, if $\mathrm{P}_1{<}\mathrm{P}_2$
then we select model $2$ as the true model, i.e. $\sigma^2 = 2$.
More generally, we can compute the ratio of these two probability densities and also define a threshold.
When the ratio is greater than the threshold, we select $H_1$, otherwise we select $H_2$.
$$
\frac{\mathrm{P}_1}{\mathrm{P}_2}
$$
A higher or lower threshold can be set if you wish to control the error rates in a frequentist manner.
Perhaps you want to assume $H_1$ in general and only select $H_2$ (reject $H_1$) when the evidence
is very strong. This would require a very low threshold, much less than $1.0$.
This is essentially the [Neyman-Pearson lemma](https://en.wikipedia.org/wiki/Neyman%E2%80%93Pearson_lemma);
when given two *point* hypotheses, you simply compute the probability of the data under both models
and compare the two probabilities.
The likelihoods are *sufficient statistics* and you can (and should) throw away the data
and use only the two *likelihoods* to select a model,
along with a threshold that you must select.
For simplicity, I will use a threshold of $1.0$ in the remainder of this document.
In other words, we simply select the model which has the larger likelihood.
$H_1$ and $H_2$ are called *point* hypotheses because the all the model parameters (mean and variance) are known
and therefore the density can be calculated exactly from the observed data $\mathbf{x}$
## (non-)Point hypotheses
Before proceeding,
Let's define $\tau{=}\frac1{\sigma^2}$, the inverse of the *variance*.
This is known as the *precision* and is often easier to work with mathematically.
We can now refer to a Normal distribution as $\mathcal{N}(\mu,\tau^{-1})$,
which is equivalent to $\mathcal{N}(\mu,\sigma^2)$.
So - returning to our models - what do we do if $H_2$ is replaced with
$H_{\tau^{-1}}$, with unknown variance (unknown $\sigma^2$)?
$$ H_{\tau^{-1}} : x_i \sim \mathcal{N}(0,\tau^{-1}) $$
How do we now select between $H_1$ and $H_{\tau^{-1}}$?
In this case, if we don't know exactly what $\tau$ is, we cannot exactly compute
$\mathrm{P}_{\tau^{-1}} = \mathrm{P}(\mathbf{x} | H_{\tau^{-1}})$.
Given a dataset which we know was drawn from $H_1$ or $H_{\tau^{-1}}$, how do we select which of those was the true model?
## Naive approach, via an estimate $\hat\tau$
Given an observed vector $\mathrm{x}$, we can estimate its precision as the precision, $\hat\tau$, which maximizes
$\mathrm{P}(\mathbf{x} | H_{\hat\tau^{-1}})$.
This is the [Maximum Likelihood Estimate (MLE)](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation):
$$
\hat\tau
=
\underset{\tau}{\operatorname{argmax}}{~} \mathrm{P}(\mathbf{x} | H_{\tau^{-1}})
~~~~~~~~~~~
$$
$$
=
\underset{\tau}{\operatorname{argmax}}{~}
\prod_{i=1}^n \mathcal{N}( x_i ; 0,{\tau^{-1}} )
$$
$$
=
\frac{n}{ \sum_{i=1}^n x_i^2 } ~~~~~~~~~~~~~~~~~~~~~~~~~
$$
(The last expression above is simply the inverse of the variance estimate $\hat\sigma^2 = \frac{1}{n} \sum_{i=1}^n x_i^2$
As we can't compute the exact $\mathrm{P}_{\tau^{-1}}$,
we instead estimate it by using $\hat\tau^{-1}$.
$$
\mathrm{\hat{P}}_{mle} =
\mathrm{\hat{P}}(\mathbf{x} | H_{\hat\tau^{-1}}) = \prod_{i=1}^n \mathcal{N}( x_i ; 0,\hat\tau^{-1} )
$$
***Note that I use $\mathrm{\hat{P}}$ to emphasize that this is merely an estimate
of the probability density.***
It would be tempting now to compare
$\mathrm{\hat{P}}_{mle}$
to
$\mathrm{ P }_1$
and use that to decide whether to select $H_{\hat\tau^{-1}}$ or $H_1$.
However, that technique is guaranteed to always select $H_{\hat\tau^{-1}}$.
By construction, the $\hat\tau^{-1}$ is the value which maximizes the likelihood and
therefore the likelihood under $\sigma^2 = \hat\tau^{-1}$ is guaranteed to be at least
as large as that under $H_1$:
$$
\mathrm{\hat{P}}_{mle} > \mathrm{ P }_1
$$
(Equality is not possible in that, as $\hat\tau^{-1}$ will never be exactly equal to $1$)
So yes, we do want to estimate
$\mathrm{P}(\mathbf{x} | H_{\tau^{-1}})$,
but
$\mathrm{\hat{P}}(\mathbf{x} | H_{\hat\tau^{-1}})$
is a very biased estimate of that and therefore can't be used as is.
Before proceeding, I should note that the [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion)
or [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) can be used to adjust
these estimates and reduce the bias. Go ahead and read up on those, they are interesting!
The BIC is particularly relevant, as it also uses a unit information prior.
But this post is about another approach.
## Priors
How do we resolve this, and use an estimate $\mathrm{P}(\mathbf{x} | H_{\hat\tau^{-1}})$ that is less biased?
We don't know the true $\tau$, but let's assume we know it was drawn from a
Gamma distribution:
$$
\tau \sim \mathrm{Gamma}(\alpha,\beta)
$$
If $\alpha$ and $\beta$ are known, we can compute the integral over $\tau$:
$$
\mathrm{P}_{exact} =
\mathrm{P}_\tau(\mathbf{x} ) = \int \mathrm{P}(\mathbf{x} | H_{\tau^{-1}}) \mathrm{P}(\tau) ~\mathrm{d}\tau
$$
We can then compare $\mathrm{P}_{exact}$ to $\mathrm{P}_1$ as usual.
I call it "*exact*" because, if we do know the true distribution from which $\tau$ was drawn,
then we essentially have a point hypothesis again.
A brief note on computation, relevant to understanding the code below: In practice, we compute that integral in the code below via the following equation, with $\tau$ on the right hand side replaced by an arbitrary value:
The denominator here, and the right hand factor in the numerator, are the density of the
arbitrary $\tau$ under, respectively, the Gamma prior and the Gamma posterior.
$$
\mathrm{P}(\mathbf{x})
=
\frac{
\mathrm{P}(\mathbf{x} | H_{\tau^{-1}}) \mathrm{P}(\tau)
}{
\mathrm{P}(\tau|\mathbf{x})
}
$$
where
$\mathrm{P}(\tau|\mathbf{x})$ is the posterior distribution, i.e. the conditional distribution of $\tau$ given $\mathrm{x}$.
This distribution is
$$
\tau|\mathbf{x} ~~ \sim ~~ \mathrm{Gamma}\left( \alpha + \frac{n}2, \beta + \frac12 \sum_{i=1}^n x_i^2 \right)
$$
## Unit information
But we generally don't know $\alpha$ or $\beta$, as we have no model of where $\tau$ (assuming $H_{\tau^{-1}}$ is the 'true' model) comes from.
To decide which values to use in place of these, we first compare the prior and posterior directly:
$$
\tau ~~ \sim ~~ \mathrm{Gamma}\left( \alpha, \beta \right) ~~~~~~~~~~~~~~~~~~~~
$$
$$
\tau|\mathbf{x} ~~ \sim ~~ \mathrm{Gamma}\left( \alpha + \frac{n}2, \beta + \frac12 \sum_{i=1}^n x_i^2 \right)
$$
The effect of the $n$ observations on $\alpha$ is to add $\frac{n}2$ to it.
Therefore, the effect of a single observation (*unit information*) is to add $\frac12$ to $\alpha$.
Similarly, $\frac1{2n} \sum_{i=1}^n x_i^2$ is added to $\beta$ by one observation.
We define the *zero information prior* as $\mathrm{Gamma}(0,0)$ (useless as a prior, as it is [*improper*](https://en.wikipedia.org/wiki/Prior_probability#Improper_priors)).
The *unit information prior (UIP)* is then defined as adding the effect of one observation to the zero information prior:
$$\tau ~~ \sim ~~ \mathrm{Gamma}\left( \frac12, \frac1{2n} \sum_{i=1}^n x_i^2 \right)$$
and therefore we use
$\alpha=\frac12$ and $\beta = \frac1{2n} \sum_{i=1}^n x_i^2$ as our prior.
Note that this prior does indeed depend on the observed data. However, it is a relatively diffuse prior as it's based on the average effect of one observation instead of the combined effect of $n$ observations.
# Simulations
To evaluate this method on these two simple models, we will do some simulations.
I draw 1000 different $\tau$ from a $\mathrm{Gamma}(1,1)$, and then draw a
sample ($n=200$) for each $\tau$. I compute four log-likelihoods for each:
1. $\mathrm{P}_1$ - assuming Hypothesis 1 is true, i.e. $\tau = 1 = \sigma^2$
1. $\mathrm{P}_2$ - assuming Hypothesis 2 is true, i.e. $\tau = 0.5; \sigma^2 = 2$
1. $\mathrm{P}_{mle}$ - assuming the true variance equals the MLE estimate in the sample.
1. $\mathrm{P}_{uip}$ - log-likelihood, after integrating over our unit information prior.
1. $\mathrm{P}_{exact}$ - log-likelihood, after integrating over the "true" prior, $\mathrm{Gamma}(1,1)$
In all the plots below, the true $\tau$ is on the x-axis.
The interesting thing in each plot is to see when the does cross the horizontal
line at $y=0$, as that is where the ratio of the (integrated) likelihoods is $1.0$.
This line decides which model has been selected.
Below that line means $H_1$ has been selected. Above means $H_1$ has been rejected.
On to the code: We define the function `a.few.loglikelihoods` which takes a sample as input
and computes those five log-likelihoods and returns them.
In this code, I make heavy use of the `data.table` package in `R`.
That's an excellent package, with an [excellent concise video tutorial](https://www.datacamp.com/community/tutorials/data-table-r-tutorial).
```{r}
set.seed(1337)
library(data.table)
library(magrittr) # for %>%
```
```{r}
n=200
a.few.loglikelihoods <- function(x) {
stopifnot(n==length(x))
mle_var = mean(x*x)
P_x_H1 = sum(dnorm(x, mean=0, sd=1, log=T)) # H_1
P_x_H2 = sum(dnorm(x, mean=0, sd=sqrt(2), log=T)) # H_2
P_x_Hmle = sum(dnorm(x, mean=0, sd=sqrt(mle_var), log=T)) # H_mle
# compute the UIP parameters for the Gamma prior
alpha_prior = 1/2 # alpha is 'shape' in {r,g,p}gamma
beta_prior = mle_var / 2 # 1/(2*n) * sum(x^2) # beta is 'rate' in {r,g,p}gamma
alpha_posterior = alpha_prior + n/2
beta_posterior = beta_prior + 1/2 * sum(x^2)
# to compute the integrated likelihood, integrate over all tau,
# we can just take an arbitrary one, as discussed above
tau_arbitrary = 1
P_x_Huip = ( sum(dnorm(x, mean=0, sd=sqrt(1/tau_arbitrary), log=T))
+ dgamma(tau_arbitrary, shape=alpha_prior , rate=beta_prior , log=T)
- dgamma(tau_arbitrary, shape=alpha_posterior, rate=beta_posterior, log=T)
)
# repeat the above with the 'true' prior parameters
alpha_prior = 1
beta_prior = 1
alpha_posterior = alpha_prior + n/2
beta_posterior = beta_prior + 1/2 * sum(x^2)
# to compute the integrated likelihood, integrate over all tau,
# we can just take an arbitrary one, as discussed above
tau_arbitrary = 1
P_x_Hexact = ( sum(dnorm(x, mean=0, sd=sqrt(1/tau_arbitrary), log=T))
+ dgamma(tau_arbitrary, shape=alpha_prior , rate=beta_prior , log=T)
- dgamma(tau_arbitrary, shape=alpha_posterior, rate=beta_posterior, log=T)
)
data.table ( P_x_H1
, P_x_H2
, P_x_Hmle
, P_x_Huip
, P_x_Hexact
)
}
# Generate 1000 taus, from the true prior, Gamma(1,1)
many.taus = rgamma(1000, 1,1)
# For each tau, generate a sample with that precision (1/variance) equal to tau.
# Also, compute all the likelihoods and store it all in a table
lapply(many.taus, function(tau) {
x = rnorm(n, mean=0, sd=sqrt(1/tau))
data.table(tau=tau, a.few.loglikelihoods(x))
}) %>%rbindlist -> d
# order the results by tau
d = d[order(tau)]
# print a few lines from 'd'
print(d)
```
# Plots and analysis
Now that the data has been prepared above, we close with a few plots to show which model is selected.
Each plot compares $H_1$ with one of the other four.
The final figure confirms that the integrated log-loglikelihood based on the unit information
prior, $\mathrm{P}_{uip}$, doesn't have the extreme bias of using the MLE, $\mathrm{P}_{mle}$.
We know that $\mathrm{P}_{exact}$ is the best possible in this case, as it uses the true distribution
of $\tau$ and therefore the Neyman-Pearson Lemma applies directly.
In the last two plots, we can see that $\mathrm{P}_{exact}$ and $\mathrm{P}_{uip}$ get very similar results,
confirming that $\mathrm{P}_{uip}$ maintains good accuracy despite not knowing the true prior for $\tau$,
which was $\mathrm{Gamma}(1,1)$.
### Figure 1
In Figure 1, we see that
the log-ratio is always positive comparing $\mathrm{P}_{mle}$ to $\mathrm{P}_1$, which confirms that $\mathrm{P}_{mle}$ is not useful;
it always rejects $H_1$ even when $\tau$ is very close to 1.
```{r fig.align='center'}
d[tau >= 0.2 & tau <= 5.0,{
plot( tau , P_x_Hmle - P_x_H1
, xlab = expression(tau) , ylab = expression('log-ratio of ' * P[mle] * ' and ' * P[1])
, log='x', main = 'Figure 1')
abline(h=0, col='grey')
}] %>% invisible
```
### Figure 2
The two vertical grey lines at 0.5 and 1.0 represent when $H_2$ and $H_1$, respectively, are true. No surprise the log-ratio crosses 0 in between those points.
```{r fig.align='center'}
d[tau >= 0.25 & tau < 2.0,{
plot( tau, P_x_H2-P_x_H1
, xlab = expression(tau) , ylab = expression('log-ratio of ' * P[2] * ' and ' * P[1])
, log='x', main='Figure 2'
)
abline(h=0, col='grey')
abline(v=1, col='grey')
abline(v=0.5, col='grey')
}] %>% invisible
```
### Figure 3
Using the "exact" prior, which is cheating.
As expected, $H_1$ is selected (y-axis < 0) when $\tau$
is close to 1.0 and the rejected (y-axis > 0) when $\tau$ is far from 1.0.
```{r fig.align='center'}
d[
tau >= 0.5 & tau < 2.0
,{
y = P_x_Hexact - P_x_H1
plot( tau, y
, xlab = expression(tau)
, ylab = expression('log-ratio of ' * P[exact] * ' and ' * P[1])
, log='x', main='Figure 3'
)
lines(tau, predict( loess( y ~ tau ))
, col='#0000d080'
, lwd=4
)
legend(x='topright',legend='fitted curve from loess()', lwd=4, col='#0000d080')
abline(h=0, col='grey')
abline(v=1, col='grey')
}] %>% invisible
```
### Figure 4
Using our UIP prior.
As expected, $H_1$ is selected (y-axis < 0) when $\tau$ is close to
1.0 and the rejected (y-axis > 0) when $\tau$ is far from 1.0.
```{r fig.align='center'}
d[
tau >= 0.5 & tau < 2.0
,{
y = P_x_Huip - P_x_H1
plot( tau , y
, xlab = expression(tau)
, ylab = expression('log-ratio of ' * P[uip] * ' and ' * P[1])
, log='x', main = 'Figure 4'
)
lines(tau, predict( loess( y ~ tau ))
, col='#0000d080'
, lwd=4
)
legend(x='topright',legend='fitted curve from loess()', lwd=4, col='#0000d080')
abline(h=0, col='grey')
abline(v=1, col='grey')
}] %>% invisible
```
# Why use this approach?
Figure 3 and Figure 4 are similar to each other, confirming that $\mathrm{P}_{uip}$
makes very similar selections to $\mathrm{P}_{exact}$, and therefore is almost
as accurate as we could hope for in this model.
The models used in this blog post are quite simple; we're really just testing if the precision (or variance) is
different from 1.0. Other, more conventional, methods exist for this kind of test.
However, if you have more than two models, and those models are very complex, it can be difficult
to discover a suitable [test statistic](https://en.wikipedia.org/wiki/Test_statistic) and to correctly
identify a suitable threshold to apply to the observed test statistic.
For some, but not all, complex models, a *unit information prior* can be identified in a relatively straightforward
manner. This allows the integrated log-likelihoods to be computed easily, and therefore allows us to
apply the Neyman-Pearon lemma in this approximate manner.
I say "approximate" because we're not using the "true" prior (assuming one exists);
where the true prior is known (the "exact" case above), we essentially have a point hypothesis.
Very complex models can be used in this framework, as long as you have distributions that have a
[conjugate prior](https://en.wikipedia.org/wiki/Conjugate_prior#Continuous_distributions). We used Gamma above, as it is conjugate to the Normal distribution where
the mean is known.
Thanks for making it this far! I'd love feedback, see my Twitter handle and my email address at the top of the page.
***by [Aaron](https://aaronmcdaid.github.io/), 29th October 2017***