require("lgpr")
# Loading required package: lgpr
# Attached lgpr 1.2.5, using rstan 2.32.7. Type ?lgpr to get started.
require("ggplot2")
# Loading required package: ggplot2
require("rstan")
# Loading required package: rstan
# Loading required package: StanHeaders
#
# rstan version 2.32.7 (Stan version 2.32.2)
# For execution on a local, multicore CPU with excess RAM we recommend calling
# options(mc.cores = parallel::detectCores()).
# To avoid recompilation of unchanged Stan programs, we recommend calling
# rstan_options(auto_write = TRUE)
# For within-chain threading using `reduce_sum()` or `map_rect()` Stan functions,
# change `threads_per_chain` option:
# rstan_options(threads_per_chain = 1)
In this tutorial we simulate and analyse a test data set which
contains 8 case and 8 control individuals, and the disease effect on
case individuals is modeled using the disease-related age
(diseaseAge) as a covariate. The disease-related age is
defined as age relative to the observed disease initiation. The disease
effect is simulated so that only 4 of the 8 case individuals are
affected.
set.seed(1213)
simData <- simulate_data(
N = 16,
t_data = seq(12, 72, by = 12),
covariates = c(0, 2, 2, 2),
relevances = c(1, 1, 1, 1, 0, 0),
lengthscales = c(18, 24, 1, 18, 18, 18),
t_effect_range = c(46, 48),
snr = 5,
N_affected = 4,
t_jitter = 0
)
plot_sim(simData)
# Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
# ℹ Please use tidy evaluation idioms with `aes()`.
# ℹ See also `vignette("ggplot2-in-packages")` for more information.
# ℹ The deprecated feature was likely used in the lgpr package.
# Please report the issue at <https://github.com/jtimonen/lgpr/issues>.
# This warning is displayed once every 8 hours.
# Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# - Dots are noisy observations of the response var.
# - Line is the true signal mapped through inv. link fun.
# - Solid vert. line is the real effect time (used to generate signal)
# - Dashed vert. line is the 'observed' disease initiation time
# plot_sim.component(simData,3)
Above, the blue line represents the data-generating signal and black dots are noisy observations of the response variable.
dat <- simData@data
str(dat)
# 'data.frame': 96 obs. of 7 variables:
# $ id : Factor w/ 16 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 2 ...
# $ age : num 12 24 36 48 60 72 12 24 36 48 ...
# $ diseaseAge: num -36 -24 -12 0 12 24 -36 -24 -12 0 ...
# $ z1 : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 1 1 1 ...
# $ z2 : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
# $ z3 : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 1 1 1 ...
# $ y : num 0.858 -0.714 -0.656 -0.677 -1.784 ...
simData@effect_times
# $true
# 1 2 3 4 5 6 7 8
# 46.26274 46.28976 47.70218 46.76020 46.01252 46.17979 46.02512 47.86529
# 9 10 11 12 13 14 15 16
# NaN NaN NaN NaN NaN NaN NaN NaN
#
# $observed
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 48 48 48 48 48 48 48 48 NaN NaN NaN NaN NaN NaN NaN NaN
We will define a formula where the term
het(id)*gp_vm(diseaseAge) declares that the
gp_vm term is heterogeneous and one level-specific
magnitude parameter is needed for each level of id.
formula <- y ~ zs(id) * gp(age) + gp(age) + het(id) * gp_vm(diseaseAge) + zs(z1) * gp(age) + zs(z2) * gp(age) + zs(z3) * gp(age)
Because diseaseAge is NaN for the control
individuals, it is automatically taken into account that an
individual-specific magnitude parameter is actually needed just for each
case individual.
fit <- lgp(
formula = formula,
data = dat,
prior = list(wrp = igam(14, 5)),
iter = 2000,
chains = 4,
cores = 4
)
Printing the fit object summarizes the posterior
print(fit)
# An object of class lgpfit. See ?lgpfit for more info.
# Inference for Stan model: lgp.
# 4 chains, each with iter=2000; warmup=1000; thin=1;
# post-warmup draws per chain=1000, total post-warmup draws=4000.
#
# mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
# alpha[1] 0.35 0.00 0.10 0.15 0.29 0.35 0.42 0.57 1622 1
# alpha[2] 0.75 0.01 0.36 0.33 0.50 0.66 0.90 1.73 2656 1
# alpha[3] 1.50 0.01 0.53 0.74 1.12 1.41 1.78 2.72 3945 1
# alpha[4] 0.50 0.01 0.35 0.12 0.27 0.41 0.63 1.42 3118 1
# alpha[5] 0.30 0.00 0.27 0.02 0.13 0.22 0.37 1.03 4241 1
# alpha[6] 0.38 0.00 0.30 0.05 0.19 0.30 0.47 1.22 3915 1
# ell[1] 1.51 0.03 1.54 0.46 0.83 1.13 1.67 4.94 1934 1
# ell[2] 0.68 0.01 0.31 0.15 0.46 0.66 0.87 1.33 2157 1
# ell[3] 0.91 0.01 0.66 0.12 0.40 0.77 1.23 2.63 3518 1
# ell[4] 1.93 0.03 1.64 0.38 1.04 1.56 2.29 5.79 2516 1
# ell[5] 1.85 0.05 2.44 0.16 0.57 1.04 2.12 8.19 2672 1
# ell[6] 2.33 0.05 2.68 0.33 0.92 1.51 2.77 8.97 2520 1
# wrp[1] 0.36 0.00 0.10 0.22 0.29 0.34 0.40 0.60 3373 1
# beta[1,1] 0.81 0.00 0.25 0.16 0.68 0.94 1.00 1.00 4435 1
# beta[1,2] 0.87 0.00 0.21 0.29 0.81 0.98 1.00 1.00 4705 1
# beta[1,3] 0.86 0.00 0.20 0.28 0.80 0.97 1.00 1.00 5002 1
# beta[1,4] 0.90 0.00 0.16 0.42 0.88 0.98 1.00 1.00 5430 1
# beta[1,5] 0.02 0.00 0.06 0.00 0.00 0.00 0.01 0.16 3008 1
# beta[1,6] 0.08 0.00 0.17 0.00 0.00 0.01 0.08 0.64 2639 1
# beta[1,7] 0.01 0.00 0.03 0.00 0.00 0.00 0.00 0.08 4552 1
# beta[1,8] 0.10 0.00 0.17 0.00 0.00 0.02 0.11 0.66 2649 1
# sigma[1] 0.45 0.00 0.05 0.35 0.41 0.45 0.48 0.56 2140 1
#
# Samples were drawn using NUTS(diag_e) at Thu Oct 30 21:16:56 2025.
# For each parameter, n_eff is a crude measure of effective sample size,
# and Rhat is the potential scale reduction factor on split chains (at
# convergence, Rhat=1).
Printing the model information clarifies the model and priors
model_summary(fit)
# Formula: y ~ zs(id) * gp(age) + gp(age) + het(id) * gp_vm(diseaseAge) + zs(z1) * gp(age) + zs(z2) * gp(age) + zs(z3) * gp(age)
# Likelihood: gaussian
# Data: 96 observations, 7 variables
#
# Component type ker het ns vm unc cat cont
# 1 zs(id)*gp(age) 2 0 0 0 0 0 1 1
# 2 gp(age) 1 0 0 0 0 0 0 1
# 3 het(id)*gp_vm(diseaseAge) 1 0 1 1 1 0 0 2
# 4 zs(z1)*gp(age) 2 0 0 0 0 0 2 1
# 5 zs(z2)*gp(age) 2 0 0 0 0 0 3 1
# 6 zs(z3)*gp(age) 2 0 0 0 0 0 4 1
#
# Variable #Missing
# 1 age 0
# 2 diseaseAge 48
#
# Factor #Levels Values
# 1 id 16 ...
# 2 z1 2 1, 2
# 3 z2 2 1, 2
# 4 z3 2 1, 2
#
# Parameter Bounds Prior
# 1 alpha[1] [0, Inf) alpha[1] ~ student-t(20)
# 2 alpha[2] [0, Inf) alpha[2] ~ student-t(20)
# 3 alpha[3] [0, Inf) alpha[3] ~ student-t(20)
# 4 alpha[4] [0, Inf) alpha[4] ~ student-t(20)
# 5 alpha[5] [0, Inf) alpha[5] ~ student-t(20)
# 6 alpha[6] [0, Inf) alpha[6] ~ student-t(20)
# 7 ell[1] [0, Inf) ell[1] ~ log-normal(0,1)
# 8 ell[2] [0, Inf) ell[2] ~ log-normal(0,1)
# 9 ell[3] [0, Inf) ell[3] ~ log-normal(0,1)
# 10 ell[4] [0, Inf) ell[4] ~ log-normal(0,1)
# 11 ell[5] [0, Inf) ell[5] ~ log-normal(0,1)
# 12 ell[6] [0, Inf) ell[6] ~ log-normal(0,1)
# 13 wrp[1] [0, Inf) wrp[1] ~ inv-gamma(14,5)
# 14 sigma[1] [0, Inf) (sigma[1])^2 ~ inv-gamma(2,1)
# 15 beta[1-8] [0, 1] beta[1-8] ~ beta(0.2, 0.2)
#
#
# id 1 2 3 4 5 6 7 8
# beta_or_teff_param_idx 1 2 3 4 5 6 7 8
#
# Created on Thu Oct 30 21:12:44 2025 with lgpr 1.2.5.
We can visualize the posterior distribution of the individual-specific disease effect magnitude parameters for each case individual. We see that for individuals 1-4 the parameters are close to 1 (affected individuals) and for individuals 5-8 there is considerable posterior mass close to 0.
plot_beta(fit)
#
# id 1 2 3 4 5 6 7 8
# beta_or_teff_param_idx 1 2 3 4 5 6 7 8
Finally we plot the inferred disease component, using posterior median hyperparameters, and see that the inferred effects have the same shape for all individuals, but are heterogeneous in magnitude.
t <- seq(0, 100, by = 1)
x_pred <- new_x(dat, t, x_ns = "diseaseAge")
p <- pred(fit, x_pred, verbose = FALSE, reduce = stats::median)
plot_f(fit, pred = p, comp_idx = 3, color_by = "diseaseAge")
sessionInfo()
# R version 4.4.1 (2024-06-14)
# Platform: aarch64-apple-darwin20
# Running under: macOS Sonoma 14.6.1
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#
# locale:
# [1] C/UTF-8/C/C/C/C
#
# time zone: Europe/Helsinki
# tzcode source: internal
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] rstan_2.32.7 StanHeaders_2.32.10 ggplot2_4.0.0
# [4] lgpr_1.2.5 rmarkdown_2.30
#
# loaded via a namespace (and not attached):
# [1] tensorA_0.36.2.1 sass_0.4.10 generics_0.1.4
# [4] stringi_1.8.7 digest_0.6.37 magrittr_2.0.4
# [7] evaluate_1.0.5 grid_4.4.1 RColorBrewer_1.1-3
# [10] fastmap_1.2.0 plyr_1.8.9 jsonlite_2.0.0
# [13] pkgbuild_1.4.8 backports_1.5.0 gridExtra_2.3
# [16] QuickJSR_1.8.1 scales_1.4.0 codetools_0.2-20
# [19] jquerylib_0.1.4 abind_1.4-8 cli_3.6.5
# [22] crayon_1.5.3 rlang_1.1.6 withr_3.0.2
# [25] cachem_1.1.0 yaml_2.3.10 tools_4.4.1
# [28] inline_0.3.21 parallel_4.4.1 reshape2_1.4.4
# [31] checkmate_2.3.3 rstantools_2.5.0 dplyr_1.1.4
# [34] colorspace_2.1-2 curl_6.2.3 posterior_1.6.1
# [37] vctrs_0.6.5 R6_2.6.1 matrixStats_1.5.0
# [40] stats4_4.4.1 lifecycle_1.0.4 stringr_1.5.2
# [43] V8_6.0.0 MASS_7.3-60.2 pkgconfig_2.0.3
# [46] RcppParallel_5.1.11-1 pillar_1.11.1 bslib_0.9.0
# [49] gtable_0.3.6 loo_2.8.0 glue_1.8.0
# [52] Rcpp_1.1.0 xfun_0.53 tibble_3.3.0
# [55] tidyselect_1.2.1 knitr_1.50 farver_2.1.2
# [58] bayesplot_1.14.0 htmltools_0.5.8.1 labeling_0.4.3
# [61] compiler_4.4.1 S7_0.2.0 distributional_0.5.0