library(lgpr)
#> Attached lgpr 1.2.5, using rstan 2.32.6. Type ?lgpr to get started.

This vignette describes mathematically the statistical models of lgpr. We study the different arguments of the lgp() or create_model() modeling functions and what parts of the probabilistic model they customize. This is a concise description, and the original publication (Timonen et al. (2021)) has more information about the actual motivation for the used modeling approaches, and the tutorials have code examples.

1. Bayesian GP regression

The models in lgpr are models for the conditional distribution p(y∣f(𝐱),ΞΈobs), p(y \mid f(\textbf{x}), \theta_{\text{obs}}), of response variable yy given covariates 𝐱\textbf{x}, where ΞΈobs\theta_{\text{obs}} is a possible parameter of the observation model (like the magnitude of observation noise). The function ff has a Gaussian Process (GP) prior fβˆΌπ’’π’«(0,k(𝐱,π±β€²βˆ£ΞΈGP)), f \sim \mathcal{GP}(0, k\left(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}})\right),

with covariance (kernel) function k(𝐱,π±β€²βˆ£ΞΈGP)k(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}) that has hyperparameters ΞΈGP\theta_{\text{GP}}. In addition to the GP prior for ff, there is a parameter prior distribution p(ΞΈ)p(\theta) for ΞΈ={ΞΈGP,ΞΈobs}\theta = \left\{ \theta_{\text{GP}}, \theta_{\text{obs}} \right\}. Given NN observations π’Ÿ={yn,𝐱n}n=1N\mathcal{D} = \{y_n, \textbf{x}_n\}_{n=1}^N the probabilistic models in lgpr have the form p(ΞΈ,𝐟)=p(𝐟∣θ)β‹…p(ΞΈ)(prior)p(𝐲∣𝐟,ΞΈ)=∏n=1Np(yn∣f(𝐱n),ΞΈobs)(likelihood),\begin{align} p\left(\theta, \textbf{f}\right) &= p\left(\textbf{f} \mid \theta\right) \cdot p(\theta) & \text{(prior)} \\ p(\textbf{y} \mid \textbf{f}, \theta) &= \prod_{n=1}^N p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}}) & \text{(likelihood)}, \end{align} where 𝐟=[f(𝐱1),…,f(𝐱N)]⊀\textbf{f} = \left[ f(\textbf{x}_1), \ldots, f(\textbf{x}_N) \right]^{\top}, 𝐲=[y1,…,yN]⊀\textbf{y} = \left[y_1, \ldots, y_N\right]^{\top}. The parameter prior density p(ΞΈ)p(\theta) is the product of the prior densities of each parameter, and the GP prior means that the prior for 𝐟\textbf{f} is the multivariate normal p(𝐟∣θ)=𝒩(𝐟∣𝟎,𝐊),\begin{equation} p\left(\textbf{f} \mid \theta\right) = \mathcal{N}\left(\textbf{f} \mid \textbf{0}, \textbf{K} \right), \end{equation} where the NΓ—NN \times N matrix 𝐊\textbf{K} has entries {𝐊}in=k(𝐱i,𝐱n∣θGP)\{ \textbf{K} \}_{in} = k(\textbf{x}_i, \textbf{x}_n \mid \theta_{\text{GP}}).

2. Connection between lgpr arguments and different model parts

The below table shows which parts of the above mathematical description are affected by which arguments to lgp() or create_model(). You can read more about them in the documentation of said functions.

Argument Affected model part
formula k(𝐱,𝐱′)k(\textbf{x}, \textbf{x}')
data π’Ÿ\mathcal{D}
likelihood p(yn∣f(𝐱n),θobs)p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}})
prior p(ΞΈ)p(\theta)
c_hat p(yn∣f(𝐱n),θobs)p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}})
num_trials π’Ÿ\mathcal{D}
options k(𝐱,𝐱′)k(\textbf{x}, \textbf{x}')

3. The likelihood argument and observation models

The terms observation model and likelihood are used to refer to the same formula, i.e.Β p(yn∣f(𝐱n),ΞΈobs)p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}}), though the former means it as a function of 𝐲\textbf{y} and the latter as a function of ΞΈ\theta. There are currently five observation models available and they all involve an inverse link function transformation hn=gβˆ’1(f(𝐱n)+cΜ‚n) h_n = g^{-1}\left( f(\textbf{x}_n)+ \hat{c}_n \right) where gg is determined by the likelihood argument and cΜ‚n\hat{c}_n by the c_hat argument. The below table shows what the link function is in different cases, and what parameter the corresponding observation model has.

likelihood Link function gg Parameter ΞΈobs\theta_{\text{obs}}
gaussian identity Οƒ\sigma
poisson logarithm -
nb logarithm Ο•\phi
binomial logit -
bb logit Ξ³\gamma
  • In the Gaussian observation model (likelihood="gaussian"), p(yn∣f(𝐱n),ΞΈobs)=𝒩(yn∣hn,Οƒ2) p(y_n \mid f(\textbf{x}_n), \theta_{\text{obs}}) = \mathcal{N}(y_n \mid h_n, \sigma^2) ΞΈobs=Οƒ\theta_{\text{obs}}=\sigma is a noise magnitude parameter.

  • The Poisson observation model (likelihood="poisson") for count data is yn∼Poisson(Ξ»n), y_n \sim \text{Poisson}\left(\lambda_n \right), where the rate is Ξ»n=hn\lambda_n = h_n.

  • In the negative binomial (likelihood="nb") model, Ξ»n\lambda_n is gamma-distributed with parameters {shape=Ο•scale=Ο•hn, \begin{cases} \text{shape} &= \phi \\ \text{scale} &= \frac{\phi}{h_n} \end{cases}, and Ο•>0\phi > 0 controls overdispersion so that Ο•β†’βˆž\phi \rightarrow \infty corresponds to the Poisson model.

  • When selecting the binomial or beta-binomial observation model for count data, the number of trials Ξ·n\eta_n, for each n=1,…,Nn=1, \ldots, N has to be supplied using the num_trials argument. The binomial model (likelihood="binomial") is yn∼Binomial(hn,Ξ·n), y_n \sim \text{Binomial}(h_n, \eta_n), where the success probability ρn=hn\rho_n = h_n.

  • In the beta-binomial model (likelihood="bb"), ρi\rho_i is random so that ρn∼Beta(hnβ‹…1βˆ’Ξ³Ξ³,(1βˆ’hn)β‹…1βˆ’Ξ³Ξ³), \rho_n \sim \text{Beta}\left(h_n \cdot \frac{1 - \gamma}{\gamma}, \ (1-h_n) \cdot \frac{1 - \gamma}{\gamma}\right), and the parameter γ∈[0,1]\gamma \in [0, 1] controls overdispersion so that Ξ³β†’0\gamma \rightarrow 0 corresponds to the binomial model.

When using the Gaussian observation model with sample_f=TRUE the continuous response yy is normalized to unit variance and zero mean, and cΜ‚n=0\hat{c}_n = 0 for all nn is set. In this case the c_hat argument has no effect. With sample_f = TRUE, sensible defaults are used. See the documentation of the c_hat argument of lgp() for exact details and the 5. Model inference section for information about the sample_f argument.

4. The formula argument and kernel functions

Additive GP regression

The GP models of lgpr are additive, so that k(𝐱,π±β€²βˆ£ΞΈGP)=βˆ‘j=1JΞ±j2kj(𝐱,π±β€²βˆ£ΞΈGP).\begin{equation} k(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}) = \sum_{j=1}^J \alpha_j^2 k_j(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}). \end{equation} This is equivalent to saying that we have f=f(1)+…+f(J)f = f^{(1)} + \ldots + f^{(J)} modeled so that each component j=1,…,Jj = 1, \ldots, J has a GP prior f(j)βˆΌπ’’π’«(0,Ξ±j2kj(𝐱,π±β€²βˆ£ΞΈGP)),\begin{equation} f^{(j)} \sim \mathcal{GP}\left(0, \alpha_j^2 k_j(\textbf{x}, \textbf{x}' \mid \theta_{\text{GP}}) \right), \end{equation} independently from other components.

Formulas and terms

The number of components JJ is equal to the number of terms in your formula. Terms are separated by a plus sign. An example formula with three terms could be

y ~ gp(age) + gp(age)*zs(id) + categ(group)

where y, age, id and group have to be columns of data. Each formula term defines what the corresponding kernel kjk_j will be like, and what covariates and parameters it depends on. Each term adds one Ξ±\alpha parameter in the GP parameter vector ΞΈGP\theta_{\text{GP}}, and possible additional parameters depending on the term.

Expressions and kernels

Each term is a product (separated by *) of what we call expressions. At this point we are not using standard R formula terminology because terms in lgpr are parsed in a custom way. Each expression corresponds to one kernel, and the kernel kjk_j is the product of all the kernels in term jj. Inside parentheses, each expression must contain the name of one data variable, as in gp(age). This determines what variable the kernel depends on. Most of the allowed expressions, their corresponding kernels, and allowed variable types are listed below.

Expression Corresponding kernel Allowed variable type
gp() Exponentiated quadratic (EQ) Continuous
zs() Zero-sum (ZS) Categorical
categ() Categorical (CAT) Categorical
gp_ns() Nonstationary (NS) Continuous
gp_vm() Variance-mask (VM) Continuous

Continuous covariates should be represented in data as numeric and categorical covariates as factors. Equations for different kernels are listed here briefly. See Timonen et al. (2021) for more motivation and details about what kind of effects they can model alone and in combinations.

  • The EQ kernel is kEQ(x,xβ€²βˆ£ΞΈGP)=exp⁑(βˆ’(xβˆ’xβ€²)22β„“2) k_{\text{EQ}}(x,x' \mid \theta_{\text{GP}}) = \exp \left( -\frac{(x-x')^2}{2 \ell^2}\right) and it has the lengthscale parameter β„“\ell. Each EQ kernel adds one lengthscale parameter to ΞΈGP\theta_{\text{GP}}.

  • The ZS kernel is kZS(z,zβ€²)={1 if z=zβ€²11βˆ’M if zβ‰ zβ€²\begin{equation} k_{\text{ZS}}(z, z') = \begin{cases} 1 \ \ \ \text{ if } z = z' \\ \frac{1}{1 - M} \ \ \ \text{ if } z \neq z' \end{cases} \end{equation} where MM is the number of different categories for covariate zz.

  • The CAT kernel is kCAT(z,zβ€²)={1 if z=zβ€²0 if zβ‰ zβ€²\begin{equation} k_{\text{CAT}}(z, z') = \begin{cases} 1 \ \ \ \text{ if } z = z' \\ 0 \ \ \ \text{ if } z \neq z' \end{cases} \end{equation}

  • The NS kernel is kNS(x,xβ€²βˆ£a,β„“)=kEQ(Ο‰a(x),Ο‰a(xβ€²)βˆ£β„“),\begin{equation} k_{\text{NS}}(x, x' \mid a, \ell) = k_{\text{EQ}}(\omega_a(x), \omega_a(x') \mid \ell), \end{equation} where Ο‰a:ℝ→]βˆ’1,1[\omega_a: \mathbb{R} \rightarrow ]-1,1[ is an input warping function Ο‰a(x)=2β‹…(11+eβˆ’axβˆ’12),\begin{equation} \omega_a(x) = 2 \cdot \left(\frac{1}{1 + e^{-a x}} - \frac{1}{2} \right), \end{equation} Each NS kernel adds one lengthscale parameter β„“\ell and one warping steepness parameter aa to ΞΈGP\theta_{\text{GP}}.

  • The VM kernel is kVM(x,xβ€²βˆ£a,β„“)=fVMa(x)β‹…fVMa(xβ€²)β‹…kNS(x,xβ€²βˆ£a,β„“),\begin{equation} k_{\text{VM}}(x, x' \mid a, \ell) = f^a_{\text{VM}}(x) \cdot f^a_{\text{VM}}(x') \cdot k_{\text{NS}}(x, x' \mid a, \ell), \end{equation} where fVMa(x)=11+eβˆ’ah2(xβˆ’r)f^a_{\text{VM}}(x) = \frac{1}{1 + e^{-a h_2 (x-r)}} and r=1alogit(h1)r = \frac{1}{a} \text{logit}(h_1). The parameters h1h_1 and h2h_2 are determined by opt$vm_params[1] and opt$vm_params[2], respectively, where opt is the options argument given to lgp(). Each VM kernel adds one lengthscale parameter β„“\ell and one warping steepness parameter aa to ΞΈGP\theta_{\text{GP}}.

Masking missing covariates

All kernels that work with continuous covariates are actually also multiplied by a binary mask (BIN) kernel kBIN(x,xβ€²)k_{\text{BIN}}(x,x') which returns 00 if either xx or xβ€²x' is missing and 11 if they are both available. Missing data should be encoded as NA or NaN in data.

Heterogeneous effects and covariate uncertainty

There are also the het() and unc() expressions. They cannot be alone in a term but have to be multiplied by EQ, NS or VM. They are not actually kernels alone but edit the covariate or kernel of their term and add additional parameters. See the tutorials for example use cases and Timonen et al. (2021) for their mathematical definition.

5. Model inference

After the model is defined, lgpr uses the MCMC methods of Stan to obtain draws from the joint posterior p(ΞΈ,πŸβˆ£π’Ÿ)p\left(\theta, \textbf{f} \mid \mathcal{D}\right) or the marginal posterior of parameters, i.e.Β  p(ΞΈβˆ£π’Ÿ)p\left(\theta \mid \mathcal{D}\right). Which one of these is done is determined by the sample_f argument of lgp() or create_model().

With sample_f = TRUE

This option is always possible but not recommended with likelihood = "gaussian". The joint posterior that is sampled from is $$\begin{equation} p\left(\theta, \textbf{f} \mid \mathcal{D}\right) \propto p\left(\theta, \textbf{f}\right) \cdot p(\textbf{y} \mid \textbf{f}, \theta) \\ \end{equation}$$ and sampling requires evaluating the right-hand side and its gradient thousands of times.

With sample_f = FALSE

This option is only possible (and is automatically selected by default) if likelihood = "gaussian". This is because p(𝐲∣θ)=𝒩(𝐲∣𝟎,𝐊+Οƒ2𝐈)\begin{equation} p\left(\textbf{y} \mid \theta\right) = \mathcal{N}\left(\textbf{y} \mid \textbf{0}, \textbf{K} + \sigma^2 \textbf{I} \right) \end{equation} is analytically available only in this case. The distribution that is sampled from is $$\begin{equation} p\left(\theta \mid \mathcal{D}\right) \propto p\left(\theta\right) \cdot p(\textbf{y} \mid \theta) \\ \end{equation}$$ and now sampling requires repeatedly evaluating the right-hand side of this equation and its gradient. This analytical marginalization reduces the MCMC dimension by NN and likely improves sampling efficiency. The conditional posterior p(𝐟∣θ,π’Ÿ)p\left(\textbf{f} \mid \theta, \mathcal{D}\right) also has an analytical form for a fixed ΞΈ\theta, so draws from the marginal posterior p(πŸβˆ£π’Ÿ)p\left(\textbf{f} \mid \mathcal{D}\right) could be obtained by first drawing ΞΈ\theta and then 𝐟\textbf{f}, according to the process θ∼p(ΞΈβˆ£π’Ÿ)𝐟∼p(𝐟∣θ,π’Ÿ).\begin{align} \theta &\sim p\left(\theta \mid \mathcal{D}\right) \\ \textbf{f} & \sim p\left(\textbf{f} \mid \theta, \mathcal{D}\right). \end{align} By combining these, we again have draws from the joint posterior p(ΞΈ,πŸβˆ£π’Ÿ)p\left(\theta, \textbf{f} \mid \mathcal{D}\right), but likely with less computational effort.

References

Timonen, Juho, Henrik MannerstrΓΆm, Aki Vehtari, and Harri LΓ€hdesmΓ€ki. 2021. β€œLgpr: An Interpretable Non-Parametric Method for Inferring Covariate Effects from Longitudinal Data.” Bioinformatics 37 (13): 1860–67. https://doi.org/10.1093/bioinformatics/btab021.