Virtual Population Generation using Gaussian Copulas

Authors

Vijay Ivaturi

David Müller-Widmann

NoteTumor Burden Tutorial Series

This is Tutorial 4 of 6 in the Tumor Burden series.

# Tutorial Status
1 Model Introduction ✓ Complete
2 Global Sensitivity Analysis ✓ Complete
3 Structural Identifiability ✓ Complete
4 Copula Vpop Generation (current) You are here
5 MILP Calibration
6 VCT Simulation

1 Learning Outcomes

Virtual populations are essential for conducting In Silico Clinical Trials (ISCTs). When parameters are estimated using Nonlinear Mixed Effects (NLME) models, they exhibit correlations that must be preserved when generating new virtual patients (Cortés-Ríos et al., 2025).

In this tutorial, you will learn:

  • Why correlated parameter sampling matters in NLME modeling
  • The theory behind Gaussian copulas and how they preserve correlations
  • How to apply logit-normal transformations for bounded parameters
  • A step-by-step implementation of the copula-based sampling algorithm
  • How to validate that generated populations preserve target correlations
  • Visualization techniques for virtual population characteristics

2 Prerequisites

This tutorial builds on the Tumor Burden Model introduced in Tutorial 1. We use the same parameter specifications.

3 Background

3.1 The Problem: Why Not Just Sample Independently?

Imagine you have fitted an NLME model and obtained estimates for three parameters: treatment-sensitive fraction (f), tumor growth rate (g), and death rate (k). Each parameter has a mean, variance, and importantly— correlations with other parameters.

A naive approach would be to sample each parameter independently from its marginal distribution. But this approach destroys the correlation structure that exists in the real patient population.

WarningThe Correlation Problem

Consider the tumor burden model from Qi & Cao (2022). The parameters f and g have a correlation of r = -0.64, meaning:

  • Patients with high f (more treatment-sensitive cells) tend to have lower g (slower tumor growth)
  • This biological relationship must be preserved in the virtual population

If we sample f and g independently:

  • The correlation will be approximately 0 instead of -0.64
  • The virtual population won’t match real clinical heterogeneity
  • ISCT predictions will be biased

3.2 What is a Copula?

A copula is a mathematical function that joins univariate marginal distributions to form a multivariate distribution. The key theorem is Sklar’s theorem (Sklar, 1959):

Any multivariate joint distribution can be written as a function of its marginal distributions and a copula that describes the dependence structure.

This means we can:

  1. Separate the marginal distributions (how each parameter is distributed individually)
  2. Capture the dependence structure in a copula
  3. Combine them to generate correlated samples

3.3 Why Gaussian Copulas?

The Gaussian copula uses the multivariate normal distribution to capture correlations (Nelsen, 2006). It has several advantages:

Feature Benefit
Easy to specify Just need correlation matrix
Spearman correlation preserved Rank correlations survive monotonic transformations
Works with any marginals Can use normal, lognormal, logit-normal, etc.
Well-understood Extensive statistical literature
TipKey Insight: Spearman Correlations

The Gaussian copula preserves Spearman (rank) correlations through any monotonic transformation. This is crucial because sampling with a copula consists of

  1. Sampling uniformly distributed values in the unit hypercube from the copula with the specified correlation structure
  2. Transforming sampled values to the target marginal distributions (e.g., logit-normal) by applying the respective quantile functions

Pearson correlations are NOT preserved through nonlinear transforms, but Spearman correlations are.

4 Libraries

Load the necessary packages for this tutorial.

using Pumas
using Copulas
using AlgebraOfGraphics
using CairoMakie
using PairPlots
using StatsBase
using SummaryTables

using LinearAlgebra
using Random
using Statistics

# Set seed for reproducibility
Random.seed!(42)

What each package does:

  • Pumas: Pharmacometric modeling and simulation
  • Copulas: Implements various copula families including Gaussian copulas
  • StatsBase: Spearman correlation computation
  • AlgebraOfGraphics + CairoMakie: Publication-quality visualizations
  • PairPlots: Corner plots showing pairwise relationships and marginal distributions
  • SummaryTables: Clean rendering of DataFrames in documents

5 Part 1: Parameter Specifications

We use the tumor burden model parameters from Tutorial 1 and Qi & Cao (2022).

5.1 Model Recap

From Tutorial 1, the tumor burden model describes:

\[N(t) = N_0 f \exp{(-k t)} + N_0 (1-f) \exp{(g t)}\]

The full non-linear mixed effect model is:

tumor_burden_model = @model begin
    @metadata begin
        desc = "Tumor Burden Model for NSCLC Chemotherapy"
        timeu = u"d"  # Time unit is days
    end

    @param begin
        # Typical values (population parameters)
        tvf  RealDomain(lower = 0.0, upper = 1.0)
        tvg  RealDomain(lower = 0.0)
        tvk  RealDomain(lower = 0.0)

        # Inter-individual variability (diagonal covariance matrix)
        # Values on the transformed scale (logit for f, log for g and k)
        Ω  PDiagDomain(3)

        # Residual error (standard deviation)
        σ  RealDomain(lower = 0.0)
    end

    @random begin
        # Random effects vector (one per parameter)
        η ~ MvNormal(Ω)
    end

    @covariates treatment

    @pre begin
        # Transform random effects to individual parameters
        # f: logit-normal transformation (bounded 0-1)
        f = logistic(logit(tvf) + η[1])
        # g: log-normal transformation (positive value)
        g = tvg * exp(η[2])
        # k: log-normal transformation (positive value)
        # Additionally, treatment affects death rate:
        # treatment = false → control arm (no killing)
        # treatment = true → treatment arm (killing at rate tvk * exp(η[3]))
        k = treatment * tvk * exp(η[3])
    end

    # Initial conditions (normalized tumor = 1)
    @init begin
        N_sens = f           # Sensitive cells
        N_res = 1 - f        # Resistant cells
    end

    @dynamics begin
        # Sensitive cells: die at rate k (0 if control)
        N_sens' = -k * N_sens

        # Resistant cells: grow at rate g
        N_res' = g * N_res
    end

    @derived begin
        # Total normalized tumor size
        Nt := @. N_sens + N_res

        # Observation with residual error
        tumor_size ~ @. Normal(Nt, σ)
    end
end
PumasModel
  Parameters: tvf, tvg, tvk, Ω, σ
  Random effects: η
  Covariates: treatment
  Dynamical system variables: N_sens, N_res
  Dynamical system type: Matrix exponential
  Derived: tumor_size
  Observed: tumor_size

For details on the model biology and parameters, see Tutorial 1: Model Introduction.

5.2 Parameter Values from Literature

The parameter values come from Qi & Cao (2022), who fit this model to NSCLC clinical trial data:

# Population (typical) parameter values
pop_f = 0.27    # 27% of tumor is treatment-sensitive
pop_g = 0.0013  # Resistant cells grow ~0.13% per day
pop_k = 0.0091  # Sensitive cells die ~0.91% per day

# Inter-individual variability (standard deviations on transformed scale)
ω_f = 2.16  # High variability in sensitive fraction
ω_g = 1.57  # Moderate variability in growth rate
ω_k = 1.24  # Moderate variability in death rate

# Residual error (standard deviation)
resid_σ = 0.095

# Critical correlation from the data
cor_f_g = -0.64  # Negative correlation: high f ↔ low g
TipWhy the Negative Correlation?

The correlation \(\rho(f, g) = -0.64\) is biologically meaningful:

Patients with more sensitive cells (high f) tend to have slower-growing resistant cells (low g).

Possible explanations:

  1. Tumor heterogeneity: Aggressive tumors may have both fewer sensitive cells AND faster-growing resistant cells
  2. Evolutionary pressure: In highly proliferative tumors, resistance mechanisms may be more prevalent
  3. Stem cell fraction: Resistant cells may correlate with cancer stem cell populations

This correlation is crucial for realistic virtual population generation (Tutorial 4).

6 Part 2: Why Independent Sampling Fails

Let us demonstrate why independent sampling breaks the correlation structure.

patients = [
    Subject(;
            id,
            time = [0.0],
            covariates = (; treatment = true),
        ) for id in 1:10_000
]
param = (;
    tvf = pop_f,
    tvg = pop_g,
    tvk = pop_k,
    Ω = Diagonal([ω_f^2, ω_g^2, ω_k^2]),
    σ = resid_σ,
)
sims = simobs(tumor_burden_model, patients, param; simulate_error = false)
Simulated population (Vector{<:Subject})
  Simulated subjects: 10000
  Simulated variables: tumor_size
vpop_indep = postprocess(sims) do generated, _
    (;
        f = only(unique(generated.f)),
        g = only(unique(generated.g)),
        k = only(unique(generated.k)),
    )
end |> DataFrame

The correlation of the samples of f and g is

cor(vpop_indep.f, vpop_indep.g)
-0.004113176838778901
corspearman(vpop_indep.f, vpop_indep.g)
-0.006267630122676301

but the expected correlation was -0.64!

WarningResult: Correlation Lost

The independently sampled population has essentially zero correlation between f and g, when the true correlation should be -0.64. This would lead to a virtual population that doesn’t represent real patient heterogeneity.

7 Part 3: The Gaussian Copula Algorithm

The copula-based sampling algorithm is illustrated in Figure 1.

Figure 1: Schema of copula-based sampling by Cortés-Ríos et al. (2025) (Figure 2).

In Pumas this algorithm corresponds to the following steps:

flowchart TB
    A["Step 1: Define Model Parameters"] --> B["Step 2: Define Correlation Matrix<br/>(Spearman correlations)"]
    B --> C["Step 3: Create Gaussian Copula<br/>with correlation matrix"]
    C --> D["Step 4: Create Joint Distribution of Random Effects<br/>from marginals and copula"]
    D --> E["Step 5: Sample Correlated Random Effects"]
    E --> F["Step 6: Perform Model Simulation"]

    style A fill:#e1f5fe,stroke:#01579b
    style F fill:#e8f5e9,stroke:#2e7d32

7.1 Step 1: Define Model Parameters

Already done above.

7.2 Step 2: Define the Correlation Matrix

Spearman correlation matrix from NLME estimation:

# r(f,g) = -0.64, r(f,k) = 0, r(g,k) = 0
Rho = [
    1.0 cor_f_g 0.0
    cor_f_g 1.0 0.0
    0.0 0.0 1.0
]
3×3 Matrix{Float64}:
  1.0   -0.64  0.0
 -0.64   1.0   0.0
  0.0    0.0   1.0

7.3 Step 3: Create the Gaussian Copula

Create Gaussian copula with the specified correlation structure:

copula = GaussianCopula(Rho)
Copulas.GaussianCopula{3, Matrix{Float64}}(Σ = [1.0 -0.64 0.0; -0.64 1.0 0.0; 0.0 0.0 1.0]))

7.4 Step 4: Create the Joint Distribution

correlated_randeffs_dist = SklarDist(
    copula,
    (Normal(0, ω_f), Normal(0, ω_g), Normal(0, ω_k)),
)
Copulas.SklarDist{Copulas.GaussianCopula{3, Matrix{Float64}}, Tuple{Distributions.Normal{Float64}, Distributions.Normal{Float64}, Distributions.Normal{Float64}}}(
C: Copulas.GaussianCopula{3, Matrix{Float64}}(Σ = [1.0 -0.64 0.0; -0.64 1.0 0.0; 0.0 0.0 1.0]))
m: (Distributions.Normal{Float64}(μ=0.0, σ=2.16), Distributions.Normal{Float64}(μ=0.0, σ=1.57), Distributions.Normal{Float64}(μ=0.0, σ=1.24))
)

7.5 Step 5: Sample Correlated Random Effects

# nrandeffs x npatients matrix
correlated_randeffs = rand(correlated_randeffs_dist, length(patients))
3×10000 Matrix{Float64}:
 -0.446284   0.226305   2.8318    …  -3.71708   -3.07187  -1.19969
 -0.167261  -1.26999   -0.829734      1.8337     1.04111   1.83408
 -0.050187   1.19742    1.48547       0.548508   1.69084  -0.227118
NoteUnderstanding Copula Samples

The matrix correlated_randeffs contains values where each random effect is distributed according to a normal distribution with the specified standard deviation:

mean(correlated_randeffs; dims = 2)
3×1 Matrix{Float64}:
  0.046800289493784536
 -0.01854676228731305
 -0.031258101343258064
std(correlated_randeffs; dims = 2)
3×1 Matrix{Float64}:
 2.186562318032789
 1.5796135139991487
 1.24433225771112

Additionally, the samples are correlated according to our Gaussian copula:

corspearman(correlated_randeffs')
3×3 Matrix{Float64}:
  1.0        -0.621919    0.0108949
 -0.621919    1.0         0.00207793
  0.0108949   0.00207793  1.0

7.6 Step 6: Perform Model Simulation

We now have correlated random effects that we can use for performing model simulations.

vrandeffs = [(; η) for η in eachcol(correlated_randeffs)]
sims = simobs(
    tumor_burden_model,
    patients,
    param,
    vrandeffs;
    simulate_error = false,
)
Simulated population (Vector{<:Subject})
  Simulated subjects: 10000
  Simulated variables: tumor_size

We create a DataFrame of the simulated parameters f, g and k of the dynamical system:

vpop = postprocess(sims) do generated, _
    (;
        f = only(unique(generated.f)),
        g = only(unique(generated.g)),
        k = only(unique(generated.k)),
    )
end |> DataFrame

table_one(vpop)
Total
f
Mean (SD) 0.368 (0.312)
Median [Min, Max] 0.288 [3.8e-5, 0.999]
g
Mean (SD) 0.00438 (0.0139)
Median [Min, Max] 0.00128 [3.88e-6, 0.551]
k
Mean (SD) 0.0189 (0.0351)
Median [Min, Max] 0.00893 [5.96e-5, 1]

8 Part 4: Validation

8.1 Check Spearman Correlations

The most critical validation is checking that our target correlations are preserved.

Compute observed Spearman correlations:

param_matrix = Matrix(vpop[:, [:f, :g, :k]])
observed_corr = corspearman(param_matrix)
3×3 Matrix{Float64}:
  1.0        -0.621919    0.0108949
 -0.621919    1.0         0.00207793
  0.0108949   0.00207793  1.0

The absolute difference from the target correlation matrix is:

@. abs(observed_corr - Rho)
3×3 Matrix{Float64}:
 0.0        0.0180807   0.0108949
 0.0180807  0.0         0.00207793
 0.0108949  0.00207793  0.0
TipValidation Passed

The observed correlations closely match the expected correlations. The small differences are due to sampling variability and decrease with larger sample sizes.

8.2 Check Marginal Distributions

Let’s verify that f, g and k follow their expected distributions by comparing the expected and observed median and omega (standard deviation of untransformed sample):

marginals_df = DataFrame(
    parameter = ["f", "g", "k"],
    median_exp = [pop_f, pop_g, pop_k],
    median_obs = [median(vpop.f), median(vpop.g), median(vpop.k)],
    omega_exp = [ω_f, ω_g, ω_k],
    omega_obs = [std(logit.(vpop.f)), std(log.(vpop.g)), std(log.(vpop.k))],
)
simple_table(
    marginals_df,
    [
        :parameter => "Parameter",
        :median_exp => "Median (expected)",
        :median_obs => "Median (observed)",
        :omega_exp => "ω (expected)",
        :omega_obs => "ω (observed)",
    ]
)
Parameter Median (expected) Median (observed) ω (expected) ω (observed)
f 0.27 0.288 2.16 2.19
g 0.0013 0.00128 1.57 1.58
k 0.0091 0.00893 1.24 1.24

9 Part 5: Visualization

9.1 Parameter Distributions

specs = mapping(
    [:f, :g, :k] .=> ["f (sensitive fraction)", "g (growth rate, d⁻¹)", "k (death rate, d⁻¹)"],
    col = dims(1),
) * (
    data(vpop) * histogram(; normalization = :pdf, datalimits = extrema) +
        data((; f = [pop_f], g = [pop_g], k = [pop_k])) * visual(VLines; label = "expected median")
)
draw(
    specs;
    axis = (; xlabel = "Value", ylabel = "Density"),
    facet = (; linkxaxes = false, linkyaxes = false),
    legend = (; position = :bottom),
)

Marginal distributions of virtual population parameters. Lines indicate expected medians from NLME estimation.

Marginal distributions of virtual population parameters. Lines indicate expected medians from NLME estimation.

9.2 Correlation Structure: Independent vs Copula Sampling

# Combine both methods into one DataFrame
vpop_total = vcat(
    vpop_indep,
    vpop;
    source = "method" => ["Independent (r ≈ 0)", "Copula (r = -0.64)"],
)

# Create scatter plot with faceting by method
plt_scatter = data(vpop_total) * mapping(
    :f => "f (sensitive fraction)",
    :g => "g (growth rate, d⁻¹)",
    layout = :method
) * visual(Scatter; marker = '○')

draw(plt_scatter)

Comparison of correlation structure: independent sampling destroys the f-g correlation, while copula sampling preserves it.

Comparison of correlation structure: independent sampling destroys the f-g correlation, while copula sampling preserves it.
NoteVisual Comparison

The difference is striking:

  • Independent Sampling: Points show no apparent relationship between f and g
  • Copula-Based Sampling: Points show a clear negative relationship— high f values tend to pair with low g values

This negative correlation is biologically meaningful: tumors with more treatment-sensitive cells tend to grow more slowly.

9.3 Corner Plot: All Pairwise Relationships

pairplot(
    vpop => (
        PairPlots.Scatter(; rasterize = 1),
        PairPlots.MarginHist(),
        PairPlots.Calculation(corspearman),
    ),
    labels = Dict(
        :f => "f (sensitive fraction)",
        :g => "g (growth rate)",
        :k => "k (death rate)",
    )
)

Corner plot showing all pairwise parameter relationships and marginal distributions. The negative correlation between f and g (r = -0.64) is clearly visible.

Corner plot showing all pairwise parameter relationships and marginal distributions. The negative correlation between f and g (r = -0.64) is clearly visible.

10 Practical Recommendations

Tip1. Always Use Spearman Correlations

When specifying the correlation matrix for a Gaussian copula:

  • Use Spearman (rank) correlations from your NLME output
  • Pearson correlations can be distorted by nonlinear transformations
  • Most NLME software reports correlation matrices on the eta (random effect) scale
Tip2. Choose Appropriate Distributions
Parameter Type Bounds Possible Distribution
Fractions (F, bioavailability) [0, 1] Logit-normal
Positive rates (CL, V) [0, ∞) Log-normal
Bounded positive (with physiological max) [0, max] Logit-normal
Unbounded (-∞, ∞) Normal
Tip3. Sample Size Considerations
  • For ISCTs: 1,000-10,000 virtual patients typically sufficient
  • Larger samples give more stable correlation estimates
Warning4. Validate Before Using

Always validate your virtual population before running ISCTs:

  1. Check correlations match expected values
  2. Check marginal distributions match expected values
  3. Visualize parameter distributions and correlations

11 Summary

11.1 Key Takeaways

  1. Independent sampling destroys correlations in the virtual population, leading to biased ISCT predictions

  2. Gaussian copulas preserve rank (Spearman) correlations through any monotonic transformation

  3. The algorithm in Pumas:

    • Define parameter specifications
    • Define Spearman correlation matrix
    • Create Gaussian copula
    • Create joint distribution from marginals of random effects and copula
    • Sample random effects from joint distribution
    • Perform simulations using the sampled random effects
    • Validate results
  4. Always validate correlations and marginals before using the virtual population

11.2 Quick Reference: Key Functions

Function Package Purpose
corspearman(X) StatsBase.jl Compute Spearman correlations
GaussianCopula(Rho) Copulas.jl Create copula with correlation matrix
SklarDist(copula, marginals) Copulas.jl Create joint distribution from marginals and copula

11.3 What’s Next

In Tutorial 5: MILP Calibration, we will:

  • Learn why virtual population calibration is essential for realistic ISCTs
  • Formulate calibration as a mixed-integer linear programming (MILP) problem
  • Calibrate the virtual population to match clinical response rate distributions
  • Validate calibration results with visualizations

11.4 References

Cortés-Ríos, J., Magee, M., Sher, A., Jusko, W. J., & Desikan, R. (2025). A step-by-step workflow for performing in silico clinical trials with nonlinear mixed effects models. CPT: Pharmacometrics & Systems Pharmacology. https://doi.org/10.1002/psp4.70122
Nelsen, R. B. (2006). An introduction to copulas (2nd ed.). Springer. https://doi.org/10.1007/0-387-28678-0
Qi, T., & Cao, Y. (2022). Virtual clinical trials: A tool for predicting patients who may benefit from treatment beyond progression with pembrolizumab in non-small cell lung cancer. CPT: Pharmacometrics & Systems Pharmacology, 12(2), 236–249. https://doi.org/10.1002/psp4.12896
Sklar, A. (1959). Fonctions de répartition à n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Université de Paris, 8, 229–231.