require("lgpr")
# Loading required package: lgpr
# Attached lgpr 1.2.4, using rstan 2.26.23. Type ?lgpr to get started.
require("ggplot2")
# Loading required package: ggplot2
require("rstan")
# Loading required package: rstan
# Loading required package: StanHeaders
# 
# rstan version 2.26.23 (Stan version 2.26.1)
# For execution on a local, multicore CPU with excess RAM we recommend calling
# options(mc.cores = parallel::detectCores()).
# To avoid recompilation of unchanged Stan programs, we recommend calling
# rstan_options(auto_write = TRUE)
# For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
# change `threads_per_chain` option:
# rstan_options(threads_per_chain = 1)

1 Introduction

In this tutorial we simulate and analyse a test data set which contains 6 case and 6 control individuals, and the disease effect on case individuals is modeled using the disease-related age (diseaseAge) as a covariate. The disease-related age is defined as age relative to the observed disease initiation. The true disease effect times for each case individual \(q=1, \ldots,6\) are drawn from \(\mathcal{N}(36,4^2)\), but the disease initiation is observable only after time \(t_q\) , which is drawn from \(t_q∼\text{Exponential}(0.05)\) .

set.seed(121)
relev           <- c(0,1,1,1,0,0)
effect_time_fun <- function(){rnorm(n = 1, mean = 36, sd = 4)}
obs_fun         <- function(t){min(t + stats::rexp(n = 1, rate = 0.05), 96 - 1e-5)}
  
simData <- simulate_data(N            = 12,
                         t_data       = seq(12, 96, by = 12),
                         covariates   = c(    0,2,2,2),
                         relevances   = relev,
                         lengthscales = c(18,24, 1.1, 18,18,18),
                         t_effect_range = effect_time_fun,
                         t_observed   = obs_fun,
                         snr          = 3)

plot_sim(simData) + xlab('Age (months)')
# - Dots are noisy observations of the response var.
# - Line is the true signal mapped through inv. link fun.
# - Solid vert. line is the real effect time (used to generate signal) 
# - Dashed vert. line is the 'observed' disease initiation time

#plot_sim(simData, comp_idx = 3) # to visualize one generated component

Above, the blue line represents the data-generating signal and black dots are noisy observations of the response variable.

dat <- simData@data
str(dat)
# 'data.frame': 96 obs. of  7 variables:
#  $ id        : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
#  $ age       : num  12 24 36 48 60 72 84 96 12 24 ...
#  $ diseaseAge: num  -36 -24 -12 0 12 24 36 48 -60 -48 ...
#  $ z1        : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 1 1 ...
#  $ z2        : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
#  $ z3        : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
#  $ y         : num  -1.507 -2.489 -2.495 -0.381 0.96 ...
simData@effect_times
# $true
#        1        2        3        4        5        6        7        8 
# 34.97856 36.43350 36.51112 35.67571 33.14527 42.46133      NaN      NaN 
#        9       10       11       12 
#      NaN      NaN      NaN      NaN 
# 
# $observed
#   1   2   3   4   5   6   7   8   9  10  11  12 
#  48  72  96  36  48  60 NaN NaN NaN NaN NaN NaN

2 Declaring effect time uncertainty

We will define a formula where the term unc(id)*gp_vm(diseaseAge) declares that the effect time for the nonstationary gp_vm term is uncertain and that one uncertainty parameter is needed for each level of id.

formula <- y ~ zs(id)*gp(age) + gp(age) + unc(id)*gp_vm(diseaseAge) + zs(z1)*gp(age) + zs(z2)*gp(age) + zs(z3)*gp(age)

Because diseaseAge is NaN for the control individuals, it is automatically taken into account that a separate uncertainty parameter is actually needed just for each case individual.

3 Defining the effect time prior

Declaring a temporally uncrertain component will add parameters teff to the model. The vector teff has length equal to the number of case individuals. We must define a prior for each teff parameter. This means that the prior argument must be a list containing elements named effect_time and effect_time_info. The first one is specified using any of the basic prior definition functions, like uniform(), normal(), etc. The second one, effect_time_info, must be a named list containing the fields

  • zero - this is a vector with same length as teff, and can be used to move the center of the prior
  • backwards - this is a boolean value, and the prior defined in effect_time will be for
    • (teff - zero) if backwards = FALSE
    • `- (teff - zero)ifbackwards = TRUE`
  • lower - this is a vector with same length as teff, and defines the lower bound for each teff parameter
  • upper - this is a vector with same length as teff, and defines the upper bound for each teff parameter

You can give zero, lower, and upper also as just one number, in which case they are turned into vectors that repeat the save value. The prior defined in effect_time will be truncated at lower and upper bounds.

3.1 Prior for the effect time directly

We had observed the disease onset at times \(48,72,96,36,48,60\) months for each case individual, respectively. Now if think that the true effect of the disease has occurred for each indiviaul at some time point before the detection of the disease, but not before age \(18\) months, we could set the prior like here.

obs_onset <- c(48,72,96,36,48,60)
lb <- 18
ub <- obs_onset
effect_time_info <- list(zero = 0, backwards = FALSE, lower = lb, upper = ub)
my_prior <- list(
  effect_time = uniform(), # between lb and ub
  effect_time_info = effect_time_info,
  wrp = igam(14, 5) # see how to set this in the 'Basic usage' tutorial
)

3.2 Prior relative to a known time point

It is possible that we want a prior where values closer to the observed onset are more likely than those closer to birth. This can be done by defining for example an exponentially decaying prior for - (teff - obs_onset), as is done here.

lb <- 18
ub <- obs_onset
effect_time_info <- list(zero = ub, backwards = TRUE, lower = lb, upper = ub)
my_prior <- list(
  wrp = igam(14,5),
  effect_time_info = effect_time_info,
  effect_time = gam(shape = 1, inv_scale = 0.05) # = Exponential(rate=0.05)
)

Now our uncertainty priors are actually for time differences relative to the observed disease initiation time, and backwards = TRUE argument is used to define the direction so that the prior is “backwards” in time. We used gam() because the Gamma distribution with shape=1 and inv_scale=lambda is equal to the Exponential distribution with rate= lambda.

4 Fitting the model

fit <- lgp(formula   = formula,
            data     = dat,
            prior    = my_prior,
            iter     = 3000,
            chains   = 4,
            cores    = 4,
            verbose  = TRUE)
# Creating model... 
# Parsing formula...
# Formula interpreted as: y ~ zs(id) * gp(age) + gp(age) + unc(id) * gp_vm(diseaseAge) + zs(z1) * gp(age) + zs(z2) * gp(age) + zs(z3) * gp(age)
# Parsing covariates and components... 
# Parsing options... 
# Parsing response and likelihood... 
# Parsing prior...
# User-specified priors found for: {wrp, effect_time, effect_time_info}.
# If any of the following parameters are included in the model, default priors are used for them: {alpha, ell, sigma, phi, beta, gamma}.
# 
# Model created, printing it here. 
# An object of class lgpmodel. See ?lgpmodel for more info.
# Formula: y ~ zs(id) * gp(age) + gp(age) + unc(id) * gp_vm(diseaseAge) + zs(z1) * gp(age) + zs(z2) * gp(age) + zs(z3) * gp(age)
# Likelihood: gaussian
# Data: 96 observations, 7 variables
# 
#                   Component type ker het ns vm unc cat cont
# 1            zs(id)*gp(age)    2   0   0  0  0   0   1    1
# 2                   gp(age)    1   0   0  0  0   0   0    1
# 3 unc(id)*gp_vm(diseaseAge)    1   0   0  1  1   1   0    2
# 4            zs(z1)*gp(age)    2   0   0  0  0   0   2    1
# 5            zs(z2)*gp(age)    2   0   0  0  0   0   3    1
# 6            zs(z3)*gp(age)    2   0   0  0  0   0   4    1
# 
#     Variable #Missing
# 1        age        0
# 2 diseaseAge       48
# 
#   Factor #Levels Values
# 1     id      12    ...
# 2     z1       2   1, 2
# 3     z2       2   1, 2
# 4     z3       2   1, 2
# 
#    Parameter   Bounds                             Prior
# 1   alpha[1] [0, Inf)          alpha[1] ~ student-t(20)
# 2   alpha[2] [0, Inf)          alpha[2] ~ student-t(20)
# 3   alpha[3] [0, Inf)          alpha[3] ~ student-t(20)
# 4   alpha[4] [0, Inf)          alpha[4] ~ student-t(20)
# 5   alpha[5] [0, Inf)          alpha[5] ~ student-t(20)
# 6   alpha[6] [0, Inf)          alpha[6] ~ student-t(20)
# 7     ell[1] [0, Inf)          ell[1] ~ log-normal(0,1)
# 8     ell[2] [0, Inf)          ell[2] ~ log-normal(0,1)
# 9     ell[3] [0, Inf)          ell[3] ~ log-normal(0,1)
# 10    ell[4] [0, Inf)          ell[4] ~ log-normal(0,1)
# 11    ell[5] [0, Inf)          ell[5] ~ log-normal(0,1)
# 12    ell[6] [0, Inf)          ell[6] ~ log-normal(0,1)
# 13    wrp[1] [0, Inf)          wrp[1] ~ inv-gamma(14,5)
# 14  sigma[1] [0, Inf)     (sigma[1])^2 ~ inv-gamma(2,1)
# 15   teff[1] [18, 48]  - (teff[1] - 48) ~ gamma(1,0.05)
# 16   teff[2] [18, 72]  - (teff[2] - 72) ~ gamma(1,0.05)
# 17   teff[3] [18, 96]  - (teff[3] - 96) ~ gamma(1,0.05)
# 18   teff[4] [18, 36]  - (teff[4] - 36) ~ gamma(1,0.05)
# 19   teff[5] [18, 48]  - (teff[5] - 48) ~ gamma(1,0.05)
# 20   teff[6] [18, 60]  - (teff[6] - 60) ~ gamma(1,0.05)
# 
#                                   
# id                     1 2 3 4 5 6
# beta_or_teff_param_idx 1 2 3 4 5 6
# 
# Created on Sun Sep 24 09:29:20 2023 with lgpr 1.2.4. 
# 
# Sampling model... 
# Sampling done. 
# 
# Postprocessing... 
# Computing analytic function posteriors... 
# |   10%|   20%|   30%|   40%|   50%|   60%|   70%|   80%|   90%|  100%| 
# ======================================================================  
# Done. 
# Postprocessing done.

Printing the fit object summarizes the posterior

print(fit)
# An object of class lgpfit. See ?lgpfit for more info.
# Inference for Stan model: lgp.
# 4 chains, each with iter=3000; warmup=1500; thin=1; 
# post-warmup draws per chain=1500, total post-warmup draws=6000.
# 
#            mean se_mean    sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
# alpha[1]   0.12    0.00  0.09  0.01  0.05  0.10  0.18  0.34  2057 1.00
# alpha[2]   0.84    0.01  0.33  0.41  0.62  0.77  0.99  1.66  2832 1.00
# alpha[3]   1.05    0.01  0.52  0.24  0.68  0.98  1.33  2.28  1728 1.00
# alpha[4]   0.77    0.01  0.37  0.33  0.52  0.68  0.93  1.75  3485 1.00
# alpha[5]   0.16    0.00  0.20  0.00  0.05  0.10  0.20  0.71  3853 1.00
# alpha[6]   0.15    0.00  0.18  0.00  0.04  0.10  0.19  0.66  4542 1.00
# ell[1]     1.64    0.04  2.19  0.12  0.44  0.96  2.00  7.21  3813 1.00
# ell[2]     0.46    0.00  0.16  0.15  0.35  0.46  0.57  0.78  1874 1.00
# ell[3]     1.48    0.03  1.67  0.21  0.61  1.04  1.76  5.61  2795 1.00
# ell[4]     0.91    0.01  0.73  0.17  0.41  0.74  1.16  2.85  2686 1.00
# ell[5]     1.95    0.04  2.45  0.13  0.49  1.16  2.47  8.69  3868 1.00
# ell[6]     1.74    0.04  2.48  0.13  0.49  0.96  2.04  8.01  3807 1.00
# wrp[1]     0.37    0.00  0.10  0.22  0.30  0.36  0.43  0.60  2523 1.00
# sigma[1]   0.53    0.00  0.06  0.42  0.49  0.53  0.57  0.64  1147 1.00
# teff[1,1] 37.24    0.14  7.16 20.96 32.67 38.31 43.10 47.32  2593 1.00
# teff[1,2] 39.83    0.35 10.10 24.81 34.53 37.75 41.42 69.56   852 1.00
# teff[1,3] 56.77    0.77 18.75 31.36 40.67 53.91 69.48 94.17   596 1.01
# teff[1,4] 31.05    0.09  4.25 20.45 28.77 32.31 34.43 35.83  2168 1.00
# teff[1,5] 37.48    0.17  6.70 19.91 34.54 38.73 42.24 46.95  1643 1.00
# teff[1,6] 47.71    0.32 10.35 21.80 41.12 49.56 57.01 59.77  1036 1.00
# 
# Samples were drawn using NUTS(diag_e) at Sun Sep 24 10:07:44 2023.
# For each parameter, n_eff is a crude measure of effective sample size,
# and Rhat is the potential scale reduction factor on split chains (at 
# convergence, Rhat=1).

Printing the model information clarifies the model and priors

model_summary(fit)
# Formula: y ~ zs(id) * gp(age) + gp(age) + unc(id) * gp_vm(diseaseAge) + zs(z1) * gp(age) + zs(z2) * gp(age) + zs(z3) * gp(age)
# Likelihood: gaussian
# Data: 96 observations, 7 variables
# 
#                   Component type ker het ns vm unc cat cont
# 1            zs(id)*gp(age)    2   0   0  0  0   0   1    1
# 2                   gp(age)    1   0   0  0  0   0   0    1
# 3 unc(id)*gp_vm(diseaseAge)    1   0   0  1  1   1   0    2
# 4            zs(z1)*gp(age)    2   0   0  0  0   0   2    1
# 5            zs(z2)*gp(age)    2   0   0  0  0   0   3    1
# 6            zs(z3)*gp(age)    2   0   0  0  0   0   4    1
# 
#     Variable #Missing
# 1        age        0
# 2 diseaseAge       48
# 
#   Factor #Levels Values
# 1     id      12    ...
# 2     z1       2   1, 2
# 3     z2       2   1, 2
# 4     z3       2   1, 2
# 
#    Parameter   Bounds                             Prior
# 1   alpha[1] [0, Inf)          alpha[1] ~ student-t(20)
# 2   alpha[2] [0, Inf)          alpha[2] ~ student-t(20)
# 3   alpha[3] [0, Inf)          alpha[3] ~ student-t(20)
# 4   alpha[4] [0, Inf)          alpha[4] ~ student-t(20)
# 5   alpha[5] [0, Inf)          alpha[5] ~ student-t(20)
# 6   alpha[6] [0, Inf)          alpha[6] ~ student-t(20)
# 7     ell[1] [0, Inf)          ell[1] ~ log-normal(0,1)
# 8     ell[2] [0, Inf)          ell[2] ~ log-normal(0,1)
# 9     ell[3] [0, Inf)          ell[3] ~ log-normal(0,1)
# 10    ell[4] [0, Inf)          ell[4] ~ log-normal(0,1)
# 11    ell[5] [0, Inf)          ell[5] ~ log-normal(0,1)
# 12    ell[6] [0, Inf)          ell[6] ~ log-normal(0,1)
# 13    wrp[1] [0, Inf)          wrp[1] ~ inv-gamma(14,5)
# 14  sigma[1] [0, Inf)     (sigma[1])^2 ~ inv-gamma(2,1)
# 15   teff[1] [18, 48]  - (teff[1] - 48) ~ gamma(1,0.05)
# 16   teff[2] [18, 72]  - (teff[2] - 72) ~ gamma(1,0.05)
# 17   teff[3] [18, 96]  - (teff[3] - 96) ~ gamma(1,0.05)
# 18   teff[4] [18, 36]  - (teff[4] - 36) ~ gamma(1,0.05)
# 19   teff[5] [18, 48]  - (teff[5] - 48) ~ gamma(1,0.05)
# 20   teff[6] [18, 60]  - (teff[6] - 60) ~ gamma(1,0.05)
# 
#                                   
# id                     1 2 3 4 5 6
# beta_or_teff_param_idx 1 2 3 4 5 6
# 
# Created on Sun Sep 24 09:29:20 2023 with lgpr 1.2.4.
rstan::get_elapsed_time(fit@stan_fit)
#          warmup   sample
# chain:1 1301.14 1002.390
# chain:2 1526.04  718.802
# chain:3 1318.06  805.970
# chain:4 1252.64  892.494

5 Visualizing the inferred effect times

We can visualize the inferred effect times for each case individual. We see that for individuals 2 and 3 the inferred effect time is much earlier than the observed one.

plot_effect_times(fit) + xlab('Age (months)')
#                                   
# id                     1 2 3 4 5 6
# beta_or_teff_param_idx 1 2 3 4 5 6

Finally we plot the inferred disease component

t <- seq(0, 100, by = 1)
x_pred <- new_x(dat, t, x_ns = 'diseaseAge')
p <- pred(fit, x_pred, verbose = FALSE)
plot_f(fit, pred = p, comp_idx = 3, color_by = 'diseaseAge')  + xlab('Age (months)')

6 Computing environment

sessionInfo()
# R version 4.1.2 (2021-11-01)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Linux Mint 21.1
# 
# Matrix products: default
# BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
# LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] rstan_2.26.23       StanHeaders_2.26.28 ggplot2_3.4.3      
# [4] lgpr_1.2.4          rmarkdown_2.23     
# 
# loaded via a namespace (and not attached):
#  [1] tidyselect_1.2.0     xfun_0.39            QuickJSR_1.0.6      
#  [4] bslib_0.5.0          reshape2_1.4.4       colorspace_2.1-0    
#  [7] vctrs_0.6.3          generics_0.1.3       htmltools_0.5.5     
# [10] stats4_4.1.2         loo_2.6.0            yaml_2.3.7          
# [13] utf8_1.2.3           rlang_1.1.1          pkgbuild_1.4.2      
# [16] jquerylib_0.1.4      pillar_1.9.0         glue_1.6.2          
# [19] withr_2.5.0          distributional_0.3.2 plyr_1.8.8          
# [22] matrixStats_1.0.0    lifecycle_1.0.3      stringr_1.5.0       
# [25] posterior_1.4.1      munsell_0.5.0        gtable_0.3.4        
# [28] codetools_0.2-18     evaluate_0.21        labeling_0.4.3      
# [31] inline_0.3.19        knitr_1.43           callr_3.7.3         
# [34] fastmap_1.1.1        ps_1.7.5             parallel_4.1.2      
# [37] fansi_1.0.4          bayesplot_1.10.0     highr_0.10          
# [40] rstantools_2.3.1.1   Rcpp_1.0.11          backports_1.4.1     
# [43] checkmate_2.2.0      scales_1.2.1         cachem_1.0.8        
# [46] RcppParallel_5.1.7   jsonlite_1.8.7       abind_1.4-5         
# [49] farver_2.1.1         gridExtra_2.3        tensorA_0.36.2      
# [52] digest_0.6.33        stringi_1.7.12       processx_3.8.2      
# [55] dplyr_1.1.3          grid_4.1.2           cli_3.6.1           
# [58] tools_4.1.2          magrittr_2.0.3       sass_0.4.6          
# [61] tibble_3.2.1         crayon_1.5.2         pkgconfig_2.0.3     
# [64] MASS_7.3-55          prettyunits_1.1.1    ggridges_0.5.4      
# [67] R6_2.5.1             compiler_4.1.2