vignettes/math.Rmd
      math.RmdThis vignette describes mathematically the statistical models of
lgpr. We study the different arguments of the
lgp() or create_model() modeling functions and
what parts of the probabilistic model they customize. This is a concise
description, and the original publication (Timonen et al. (2021)) has more information
about the actual motivation for the used modeling approaches, and the tutorials have code
examples.
The models in lgpr are models for the conditional
distribution
 of response variable
given covariates
,
where
is a possible parameter of the observation model (like the magnitude of
observation noise). The function
has a Gaussian Process (GP) prior
with covariance (kernel) function
that has hyperparameters
.
In addition to the GP prior for
,
there is a parameter prior distribution
for
.
Given
observations
the probabilistic models in lgpr have the form
 where
,
.
The parameter prior density
is the product of the prior densities of each parameter, and the GP
prior means that the prior for
is the multivariate normal
 where the
matrix
has entries
.
The below table shows which parts of the above mathematical
description are affected by which arguments to lgp() or
create_model(). You can read more about them in the
documentation of said functions.
| Argument | Affected model part | 
|---|---|
formula | 
|
data | 
|
likelihood | 
|
prior | 
|
c_hat | 
|
num_trials | 
|
options | 
likelihood argument and observation models
The terms observation model and
likelihood are used to refer to the same formula,
i.e.Β ,
though the former means it as a function of
and the latter as a function of
.
There are currently five observation models available and they all
involve an inverse link function transformation
 where
is determined by the likelihood argument and
by the c_hat argument. The below table shows what the link
function is in different cases, and what parameter the corresponding
observation model has.
likelihood | 
Link function | Parameter | 
|---|---|---|
gaussian | 
identity | |
poisson | 
logarithm | - | 
nb | 
logarithm | |
binomial | 
logit | - | 
bb | 
logit | 
In the Gaussian observation model
(likelihood="gaussian"),
is a noise magnitude parameter.
The Poisson observation model
(likelihood="poisson") for count data is
 where the rate is
.
In the negative binomial
(likelihood="nb") model,
is gamma-distributed with parameters
 and
controls overdispersion so that
corresponds to the Poisson model.
When selecting the binomial or beta-binomial observation model
for count data, the number of trials
,
for each
has to be supplied using the num_trials argument. The
binomial model (likelihood="binomial") is
 where the success probability
.
In the beta-binomial model
(likelihood="bb"),
is random so that
 and the parameter
controls overdispersion so that
corresponds to the binomial model.
When using the Gaussian observation model with
sample_f=TRUE the continuous response
is normalized to unit variance and zero mean, and
for all
is set. In this case the c_hat argument has no effect. With
sample_f = TRUE, sensible defaults are used. See the
documentation of the c_hat argument of lgp()
for exact details and the 5. Model
inference section for information about the sample_f
argument.
formula argument and kernel functions
The GP models of lgpr are additive, so that
 This is equivalent to
saying that we have
modeled so that each component
has a GP prior
 independently from other
components.
The number of components
is equal to the number of terms in your formula. Terms are
separated by a plus sign. An example formula with three terms could
be
y ~ gp(age) + gp(age)*zs(id) + categ(group)
where y, age, id and
group have to be columns of data. Each formula
term defines what the corresponding kernel
will be like, and what covariates and parameters it depends on. Each
term adds one
parameter in the GP parameter vector
,
and possible additional parameters depending on the term.
Each term is a product (separated by *) of what we call
expressions. At this point we are not using standard R formula
terminology because terms in lgpr are parsed in a custom
way. Each expression corresponds to one kernel, and the kernel
is the product of all the kernels in term
.
Inside parentheses, each expression must contain the name of one
data variable, as in gp(age). This determines
what variable the kernel depends on. Most of the allowed expressions,
their corresponding kernels, and allowed variable types are listed
below.
| Expression | Corresponding kernel | Allowed variable type | 
|---|---|---|
gp() | 
Exponentiated quadratic (EQ) | Continuous | 
zs() | 
Zero-sum (ZS) | Categorical | 
categ() | 
Categorical (CAT) | Categorical | 
gp_ns() | 
Nonstationary (NS) | Continuous | 
gp_vm() | 
Variance-mask (VM) | Continuous | 
Continuous covariates should be represented in data as
numeric and categorical covariates as factors.
Equations for different kernels are listed here briefly. See Timonen et al. (2021) for more motivation and
details about what kind of effects they can model alone and in
combinations.
The EQ kernel is and it has the lengthscale parameter . Each EQ kernel adds one lengthscale parameter to .
The ZS kernel is where is the number of different categories for covariate .
The CAT kernel is
The NS kernel is where is an input warping function Each NS kernel adds one lengthscale parameter and one warping steepness parameter to .
The VM kernel is
 where
and
.
The parameters
and
are determined by opt$vm_params[1] and
opt$vm_params[2], respectively, where opt is
the options argument given to lgp(). Each VM
kernel adds one lengthscale parameter
and one warping steepness parameter
to
.
All kernels that work with continuous covariates are actually also
multiplied by a binary mask (BIN) kernel
which returns
if either
or
is missing and
if they are both available. Missing data should be encoded as
NA or NaN in data.
There are also the het() and unc()
expressions. They cannot be alone in a term but have to be multiplied by
EQ, NS or VM. They are not actually kernels alone but edit the covariate
or kernel of their term and add additional parameters. See the tutorials
for example use cases and Timonen et al.
(2021) for their mathematical definition.
After the model is defined, lgpr uses the MCMC methods
of Stan to obtain draws from the joint posterior
or the marginal posterior of parameters, i.e.Β 
.
Which one of these is done is determined by the sample_f
argument of lgp() or create_model().
This option is always possible but not recommended with
likelihood = "gaussian". The joint posterior that is
sampled from is $$\begin{equation}
p\left(\theta, \textbf{f} \mid \mathcal{D}\right) \propto p\left(\theta,
\textbf{f}\right) \cdot p(\textbf{y} \mid \textbf{f}, \theta) \\
\end{equation}$$ and sampling requires evaluating the right-hand
side and its gradient thousands of times.
This option is only possible (and is automatically selected by
default) if likelihood = "gaussian". This is because
 is analytically available
only in this case. The distribution that is sampled from is $$\begin{equation}
p\left(\theta \mid \mathcal{D}\right) \propto p\left(\theta\right) \cdot
p(\textbf{y} \mid \theta) \\
\end{equation}$$ and now sampling requires repeatedly evaluating
the right-hand side of this equation and its gradient. This analytical
marginalization reduces the MCMC dimension by
and likely improves sampling efficiency. The conditional posterior
also has an analytical form for a fixed
,
so draws from the marginal posterior
could be obtained by first drawing
and then
,
according to the process
 By combining these, we again
have draws from the joint posterior
,
but likely with less computational effort.