using Pumas
using Copulas
using AlgebraOfGraphics
using CairoMakie
using PairPlots
using StatsBase
using SummaryTables
using LinearAlgebra
using Random
using Statistics
# Set seed for reproducibility
Random.seed!(42)
Virtual Population Generation using Gaussian Copulas
This is Tutorial 4 of 6 in the Tumor Burden series.
| # | Tutorial | Status |
|---|---|---|
| 1 | Model Introduction | ✓ Complete |
| 2 | Global Sensitivity Analysis | ✓ Complete |
| 3 | Structural Identifiability | ✓ Complete |
| 4 | Copula Vpop Generation (current) | You are here |
| 5 | MILP Calibration | |
| 6 | VCT Simulation |
1 Learning Outcomes
Virtual populations are essential for conducting In Silico Clinical Trials (ISCTs). When parameters are estimated using Nonlinear Mixed Effects (NLME) models, they exhibit correlations that must be preserved when generating new virtual patients (Cortés-Ríos et al., 2025).
In this tutorial, you will learn:
- Why correlated parameter sampling matters in NLME modeling
- The theory behind Gaussian copulas and how they preserve correlations
- How to apply logit-normal transformations for bounded parameters
- A step-by-step implementation of the copula-based sampling algorithm
- How to validate that generated populations preserve target correlations
- Visualization techniques for virtual population characteristics
2 Prerequisites
This tutorial builds on the Tumor Burden Model introduced in Tutorial 1. We use the same parameter specifications.
3 Background
3.1 The Problem: Why Not Just Sample Independently?
Imagine you have fitted an NLME model and obtained estimates for three parameters: treatment-sensitive fraction (f), tumor growth rate (g), and death rate (k). Each parameter has a mean, variance, and importantly— correlations with other parameters.
A naive approach would be to sample each parameter independently from its marginal distribution. But this approach destroys the correlation structure that exists in the real patient population.
Consider the tumor burden model from Qi & Cao (2022). The parameters f and g have a correlation of r = -0.64, meaning:
- Patients with high f (more treatment-sensitive cells) tend to have lower g (slower tumor growth)
- This biological relationship must be preserved in the virtual population
If we sample f and g independently:
- The correlation will be approximately 0 instead of -0.64
- The virtual population won’t match real clinical heterogeneity
- ISCT predictions will be biased
3.2 What is a Copula?
A copula is a mathematical function that joins univariate marginal distributions to form a multivariate distribution. The key theorem is Sklar’s theorem (Sklar, 1959):
Any multivariate joint distribution can be written as a function of its marginal distributions and a copula that describes the dependence structure.
This means we can:
- Separate the marginal distributions (how each parameter is distributed individually)
- Capture the dependence structure in a copula
- Combine them to generate correlated samples
3.3 Why Gaussian Copulas?
The Gaussian copula uses the multivariate normal distribution to capture correlations (Nelsen, 2006). It has several advantages:
| Feature | Benefit |
|---|---|
| Easy to specify | Just need correlation matrix |
| Spearman correlation preserved | Rank correlations survive monotonic transformations |
| Works with any marginals | Can use normal, lognormal, logit-normal, etc. |
| Well-understood | Extensive statistical literature |
The Gaussian copula preserves Spearman (rank) correlations through any monotonic transformation. This is crucial because sampling with a copula consists of
- Sampling uniformly distributed values in the unit hypercube from the copula with the specified correlation structure
- Transforming sampled values to the target marginal distributions (e.g., logit-normal) by applying the respective quantile functions
Pearson correlations are NOT preserved through nonlinear transforms, but Spearman correlations are.
4 Libraries
Load the necessary packages for this tutorial.
What each package does:
Pumas: Pharmacometric modeling and simulationCopulas: Implements various copula families including Gaussian copulasStatsBase: Spearman correlation computationAlgebraOfGraphics+CairoMakie: Publication-quality visualizationsPairPlots: Corner plots showing pairwise relationships and marginal distributionsSummaryTables: Clean rendering of DataFrames in documents
5 Part 1: Parameter Specifications
We use the tumor burden model parameters from Tutorial 1 and Qi & Cao (2022).
5.1 Model Recap
From Tutorial 1, the tumor burden model describes:
\[N(t) = N_0 f \exp{(-k t)} + N_0 (1-f) \exp{(g t)}\]
The full non-linear mixed effect model is:
tumor_burden_model = @model begin
@metadata begin
desc = "Tumor Burden Model for NSCLC Chemotherapy"
timeu = u"d" # Time unit is days
end
@param begin
# Typical values (population parameters)
tvf ∈ RealDomain(lower = 0.0, upper = 1.0)
tvg ∈ RealDomain(lower = 0.0)
tvk ∈ RealDomain(lower = 0.0)
# Inter-individual variability (diagonal covariance matrix)
# Values on the transformed scale (logit for f, log for g and k)
Ω ∈ PDiagDomain(3)
# Residual error (standard deviation)
σ ∈ RealDomain(lower = 0.0)
end
@random begin
# Random effects vector (one per parameter)
η ~ MvNormal(Ω)
end
@covariates treatment
@pre begin
# Transform random effects to individual parameters
# f: logit-normal transformation (bounded 0-1)
f = logistic(logit(tvf) + η[1])
# g: log-normal transformation (positive value)
g = tvg * exp(η[2])
# k: log-normal transformation (positive value)
# Additionally, treatment affects death rate:
# treatment = false → control arm (no killing)
# treatment = true → treatment arm (killing at rate tvk * exp(η[3]))
k = treatment * tvk * exp(η[3])
end
# Initial conditions (normalized tumor = 1)
@init begin
N_sens = f # Sensitive cells
N_res = 1 - f # Resistant cells
end
@dynamics begin
# Sensitive cells: die at rate k (0 if control)
N_sens' = -k * N_sens
# Resistant cells: grow at rate g
N_res' = g * N_res
end
@derived begin
# Total normalized tumor size
Nt := @. N_sens + N_res
# Observation with residual error
tumor_size ~ @. Normal(Nt, σ)
end
endPumasModel
Parameters: tvf, tvg, tvk, Ω, σ
Random effects: η
Covariates: treatment
Dynamical system variables: N_sens, N_res
Dynamical system type: Matrix exponential
Derived: tumor_size
Observed: tumor_size
For details on the model biology and parameters, see Tutorial 1: Model Introduction.
5.2 Parameter Values from Literature
The parameter values come from Qi & Cao (2022), who fit this model to NSCLC clinical trial data:
# Population (typical) parameter values
pop_f = 0.27 # 27% of tumor is treatment-sensitive
pop_g = 0.0013 # Resistant cells grow ~0.13% per day
pop_k = 0.0091 # Sensitive cells die ~0.91% per day
# Inter-individual variability (standard deviations on transformed scale)
ω_f = 2.16 # High variability in sensitive fraction
ω_g = 1.57 # Moderate variability in growth rate
ω_k = 1.24 # Moderate variability in death rate
# Residual error (standard deviation)
resid_σ = 0.095
# Critical correlation from the data
cor_f_g = -0.64 # Negative correlation: high f ↔ low gThe correlation \(\rho(f, g) = -0.64\) is biologically meaningful:
Patients with more sensitive cells (high f) tend to have slower-growing resistant cells (low g).
Possible explanations:
- Tumor heterogeneity: Aggressive tumors may have both fewer sensitive cells AND faster-growing resistant cells
- Evolutionary pressure: In highly proliferative tumors, resistance mechanisms may be more prevalent
- Stem cell fraction: Resistant cells may correlate with cancer stem cell populations
This correlation is crucial for realistic virtual population generation (Tutorial 4).
6 Part 2: Why Independent Sampling Fails
Let us demonstrate why independent sampling breaks the correlation structure.
patients = [
Subject(;
id,
time = [0.0],
covariates = (; treatment = true),
) for id in 1:10_000
]
param = (;
tvf = pop_f,
tvg = pop_g,
tvk = pop_k,
Ω = Diagonal([ω_f^2, ω_g^2, ω_k^2]),
σ = resid_σ,
)
sims = simobs(tumor_burden_model, patients, param; simulate_error = false)Simulated population (Vector{<:Subject})
Simulated subjects: 10000
Simulated variables: tumor_size
vpop_indep = postprocess(sims) do generated, _
(;
f = only(unique(generated.f)),
g = only(unique(generated.g)),
k = only(unique(generated.k)),
)
end |> DataFrameThe correlation of the samples of f and g is
cor(vpop_indep.f, vpop_indep.g)-0.004113176838778901
corspearman(vpop_indep.f, vpop_indep.g)-0.006267630122676301
but the expected correlation was -0.64!
The independently sampled population has essentially zero correlation between f and g, when the true correlation should be -0.64. This would lead to a virtual population that doesn’t represent real patient heterogeneity.
7 Part 3: The Gaussian Copula Algorithm
The copula-based sampling algorithm is illustrated in Figure 1.
In Pumas this algorithm corresponds to the following steps:
flowchart TB
A["Step 1: Define Model Parameters"] --> B["Step 2: Define Correlation Matrix<br/>(Spearman correlations)"]
B --> C["Step 3: Create Gaussian Copula<br/>with correlation matrix"]
C --> D["Step 4: Create Joint Distribution of Random Effects<br/>from marginals and copula"]
D --> E["Step 5: Sample Correlated Random Effects"]
E --> F["Step 6: Perform Model Simulation"]
style A fill:#e1f5fe,stroke:#01579b
style F fill:#e8f5e9,stroke:#2e7d32
7.1 Step 1: Define Model Parameters
Already done above.
7.2 Step 2: Define the Correlation Matrix
Spearman correlation matrix from NLME estimation:
# r(f,g) = -0.64, r(f,k) = 0, r(g,k) = 0
Rho = [
1.0 cor_f_g 0.0
cor_f_g 1.0 0.0
0.0 0.0 1.0
]3×3 Matrix{Float64}:
1.0 -0.64 0.0
-0.64 1.0 0.0
0.0 0.0 1.0
7.3 Step 3: Create the Gaussian Copula
Create Gaussian copula with the specified correlation structure:
copula = GaussianCopula(Rho)Copulas.GaussianCopula{3, Matrix{Float64}}(Σ = [1.0 -0.64 0.0; -0.64 1.0 0.0; 0.0 0.0 1.0]))
7.4 Step 4: Create the Joint Distribution
correlated_randeffs_dist = SklarDist(
copula,
(Normal(0, ω_f), Normal(0, ω_g), Normal(0, ω_k)),
)Copulas.SklarDist{Copulas.GaussianCopula{3, Matrix{Float64}}, Tuple{Distributions.Normal{Float64}, Distributions.Normal{Float64}, Distributions.Normal{Float64}}}(
C: Copulas.GaussianCopula{3, Matrix{Float64}}(Σ = [1.0 -0.64 0.0; -0.64 1.0 0.0; 0.0 0.0 1.0]))
m: (Distributions.Normal{Float64}(μ=0.0, σ=2.16), Distributions.Normal{Float64}(μ=0.0, σ=1.57), Distributions.Normal{Float64}(μ=0.0, σ=1.24))
)
7.6 Step 6: Perform Model Simulation
We now have correlated random effects that we can use for performing model simulations.
vrandeffs = [(; η) for η in eachcol(correlated_randeffs)]
sims = simobs(
tumor_burden_model,
patients,
param,
vrandeffs;
simulate_error = false,
)Simulated population (Vector{<:Subject})
Simulated subjects: 10000
Simulated variables: tumor_size
We create a DataFrame of the simulated parameters f, g and k of the dynamical system:
vpop = postprocess(sims) do generated, _
(;
f = only(unique(generated.f)),
g = only(unique(generated.g)),
k = only(unique(generated.k)),
)
end |> DataFrame
table_one(vpop)| Total | |
| f | |
| Mean (SD) | 0.368 (0.312) |
| Median [Min, Max] | 0.288 [3.8e-5, 0.999] |
| g | |
| Mean (SD) | 0.00438 (0.0139) |
| Median [Min, Max] | 0.00128 [3.88e-6, 0.551] |
| k | |
| Mean (SD) | 0.0189 (0.0351) |
| Median [Min, Max] | 0.00893 [5.96e-5, 1] |
8 Part 4: Validation
8.1 Check Spearman Correlations
The most critical validation is checking that our target correlations are preserved.
Compute observed Spearman correlations:
param_matrix = Matrix(vpop[:, [:f, :g, :k]])
observed_corr = corspearman(param_matrix)3×3 Matrix{Float64}:
1.0 -0.621919 0.0108949
-0.621919 1.0 0.00207793
0.0108949 0.00207793 1.0
The absolute difference from the target correlation matrix is:
@. abs(observed_corr - Rho)3×3 Matrix{Float64}:
0.0 0.0180807 0.0108949
0.0180807 0.0 0.00207793
0.0108949 0.00207793 0.0
The observed correlations closely match the expected correlations. The small differences are due to sampling variability and decrease with larger sample sizes.
8.2 Check Marginal Distributions
Let’s verify that f, g and k follow their expected distributions by comparing the expected and observed median and omega (standard deviation of untransformed sample):
marginals_df = DataFrame(
parameter = ["f", "g", "k"],
median_exp = [pop_f, pop_g, pop_k],
median_obs = [median(vpop.f), median(vpop.g), median(vpop.k)],
omega_exp = [ω_f, ω_g, ω_k],
omega_obs = [std(logit.(vpop.f)), std(log.(vpop.g)), std(log.(vpop.k))],
)
simple_table(
marginals_df,
[
:parameter => "Parameter",
:median_exp => "Median (expected)",
:median_obs => "Median (observed)",
:omega_exp => "ω (expected)",
:omega_obs => "ω (observed)",
]
)| Parameter | Median (expected) | Median (observed) | ω (expected) | ω (observed) |
| f | 0.27 | 0.288 | 2.16 | 2.19 |
| g | 0.0013 | 0.00128 | 1.57 | 1.58 |
| k | 0.0091 | 0.00893 | 1.24 | 1.24 |
9 Part 5: Visualization
9.1 Parameter Distributions
specs = mapping(
[:f, :g, :k] .=> ["f (sensitive fraction)", "g (growth rate, d⁻¹)", "k (death rate, d⁻¹)"],
col = dims(1),
) * (
data(vpop) * histogram(; normalization = :pdf, datalimits = extrema) +
data((; f = [pop_f], g = [pop_g], k = [pop_k])) * visual(VLines; label = "expected median")
)
draw(
specs;
axis = (; xlabel = "Value", ylabel = "Density"),
facet = (; linkxaxes = false, linkyaxes = false),
legend = (; position = :bottom),
)9.2 Correlation Structure: Independent vs Copula Sampling
# Combine both methods into one DataFrame
vpop_total = vcat(
vpop_indep,
vpop;
source = "method" => ["Independent (r ≈ 0)", "Copula (r = -0.64)"],
)
# Create scatter plot with faceting by method
plt_scatter = data(vpop_total) * mapping(
:f => "f (sensitive fraction)",
:g => "g (growth rate, d⁻¹)",
layout = :method
) * visual(Scatter; marker = '○')
draw(plt_scatter)The difference is striking:
- Independent Sampling: Points show no apparent relationship between f and g
- Copula-Based Sampling: Points show a clear negative relationship— high f values tend to pair with low g values
This negative correlation is biologically meaningful: tumors with more treatment-sensitive cells tend to grow more slowly.
9.3 Corner Plot: All Pairwise Relationships
pairplot(
vpop => (
PairPlots.Scatter(; rasterize = 1),
PairPlots.MarginHist(),
PairPlots.Calculation(corspearman),
),
labels = Dict(
:f => "f (sensitive fraction)",
:g => "g (growth rate)",
:k => "k (death rate)",
)
)10 Practical Recommendations
When specifying the correlation matrix for a Gaussian copula:
- Use Spearman (rank) correlations from your NLME output
- Pearson correlations can be distorted by nonlinear transformations
- Most NLME software reports correlation matrices on the eta (random effect) scale
| Parameter Type | Bounds | Possible Distribution |
|---|---|---|
| Fractions (F, bioavailability) | [0, 1] | Logit-normal |
| Positive rates (CL, V) | [0, ∞) | Log-normal |
| Bounded positive (with physiological max) | [0, max] | Logit-normal |
| Unbounded | (-∞, ∞) | Normal |
- For ISCTs: 1,000-10,000 virtual patients typically sufficient
- Larger samples give more stable correlation estimates
Always validate your virtual population before running ISCTs:
- Check correlations match expected values
- Check marginal distributions match expected values
- Visualize parameter distributions and correlations
11 Summary
11.1 Key Takeaways
Independent sampling destroys correlations in the virtual population, leading to biased ISCT predictions
Gaussian copulas preserve rank (Spearman) correlations through any monotonic transformation
The algorithm in Pumas:
- Define parameter specifications
- Define Spearman correlation matrix
- Create Gaussian copula
- Create joint distribution from marginals of random effects and copula
- Sample random effects from joint distribution
- Perform simulations using the sampled random effects
- Validate results
Always validate correlations and marginals before using the virtual population
11.2 Quick Reference: Key Functions
| Function | Package | Purpose |
|---|---|---|
corspearman(X) |
StatsBase.jl | Compute Spearman correlations |
GaussianCopula(Rho) |
Copulas.jl | Create copula with correlation matrix |
SklarDist(copula, marginals) |
Copulas.jl | Create joint distribution from marginals and copula |
11.3 What’s Next
In Tutorial 5: MILP Calibration, we will:
- Learn why virtual population calibration is essential for realistic ISCTs
- Formulate calibration as a mixed-integer linear programming (MILP) problem
- Calibrate the virtual population to match clinical response rate distributions
- Validate calibration results with visualizations
