using Pumas
using ModelingToolkit
using StructuralIdentifiability
using AlgebraOfGraphics
using CairoMakie
using DataFramesMeta
using SummaryTables
import Logging
Structural Identifiability Analysis
Can We Actually Estimate These Parameters?
This is Tutorial 3 of 6 in the Tumor Burden series.
| # | Tutorial | Status |
|---|---|---|
| 1 | Model Introduction | ✓ Complete |
| 2 | Global Sensitivity Analysis | ✓ Complete |
| 3 | Structural Identifiability (current) | ← You are here |
| 4 | Copula Vpop Generation | |
| 5 | MILP Calibration | |
| 6 | VCT Simulation |
1 Learning Outcomes
Before attempting parameter estimation in pharmacometric models, it is essential to verify that parameters can actually be uniquely determined from the available data. This prerequisite check is called structural identifiability analysis, and is a critical early step in the In Silico Clinical Trial (ISCT) workflow (Cortés-Ríos et al., 2025).
In this tutorial, you will learn:
- What structural identifiability means and why it matters for NLME modeling
- The three classifications: globally identifiable, locally identifiable, and non-identifiable
- How to use
StructuralIdentifiability.jlto analyze Pumas models - How different measurement scenarios affect identifiability
- How to interpret identifiability results and what actions to take
- How to find identifiable parameter combinations for reparameterization
This tutorial uses the tumor burden model introduced in Tutorial 1. You should be familiar with:
- The biological meaning of parameters f, g, and k
- The two-population model structure (sensitive vs resistant cells)
- Basic Pumas model syntax (
@param,@dynamics,@derived)
2 Background
2.1 The Problem: Can We Actually Estimate These Parameters?
Imagine you’ve built a pharmacometric model with several parameters. You collect clinical data, run your estimation algorithm, and… it doesn’t converge. Or worse, it converges to wildly different values depending on your initial estimates.
Why does this happen?
Often, the issue isn’t your data quality or your algorithm— it’s that some parameters are fundamentally impossible to estimate from the available measurements. This is a property of the model structure itself, and it’s called structural identifiability.
Structural identifiability asks: Given perfect, noise-free data measured over infinite time, can each parameter be uniquely determined?
If the answer is “no” for some parameters, then no amount of data or sophisticated algorithms will help— you need to change your model or add measurements.
2.2 What is Structural Identifiability?
Structural identifiability analyzes whether model parameters can be uniquely estimated from input-output data under ideal conditions:
- Perfect (noise-free) observations — no measurement error
- Infinite time horizon — unlimited observation duration
- Known model structure — the equations are correct
- Known initial conditions (or treated as parameters)
This is a theoretical property of the model structure, independent of actual data quality.
2.3 The Three Classifications
Every parameter falls into one of three identifiability classes:
| Classification | Symbol | What It Means | Practical Impact |
|---|---|---|---|
| Globally identifiable | globally |
Exactly one parameter value produces the observed output | Safe to estimate — unique solution exists |
| Locally identifiable | locally |
A finite number of values produce the same output | Estimation depends on initial values — may find wrong solution |
| Non-identifiable | nonidentifiable |
Infinitely many values produce identical output | Cannot be estimated — must fix or reparameterize |
2.4 A Simple Analogy
Think of identifiability like solving equations:
- Globally identifiable:
x + 3 = 5→ Only one solution:x = 2 - Locally identifiable:
x² = 4→ Two solutions:x = 2orx = -2 - Non-identifiable:
x + y = 5with only this equation → Infinite solutions
In pharmacometrics, the “equations” are your ODEs, and the “solutions” are your parameter values.
2.5 Why Check Identifiability Before Estimation?
- Estimation fails to converge — optimizer wanders in flat likelihood regions
- Different initial values give different final estimates — locally identifiable parameters
- Parameter confidence intervals are enormous — sign of non-identifiability
- Biologically implausible estimates — compensating for unidentifiable parameters
- Model predictions are good but parameters are wrong — many parameter sets give same output
3 Libraries
Load the necessary packages for this tutorial.
What each package does:
Pumas: Pharmacometric modeling and simulationStructuralIdentifiability: The core analysis package — implements algorithms to determine identifiability (Dong et al., 2023)ModelingToolkit: Symbolic math framework — needed to define measurement equationsAlgebraOfGraphics+CairoMakie: Visualization — for creating publication-quality plotsSummaryTables: Clean rendering of data frames in documents
4 Part 1: The Tumor Burden Model
Let’s start with the 3-parameter tumor burden model from cancer chemotherapy. This model is small enough to understand completely, making it perfect for learning identifiability concepts.
4.1 The Clinical Scenario
A patient receives chemotherapy for lung cancer. We measure tumor size over time using imaging (CT scans). The tumor initially shrinks (sensitive cells die), but eventually regrows (resistant cells take over).
Question: From tumor size measurements alone, can we estimate all the model parameters?
4.2 Model Definition
As introduced in Tutorial 1, the tumor burden model divides tumor cells into two populations:
flowchart LR
subgraph Tumor["Total Tumor: Nt = N_sens + N_res"]
direction TB
N_sens["N_sens<br/>(sensitive cells)<br/>Initial: f × N₀"]
N_res["N_res<br/>(resistant cells)<br/>Initial: (1-f) × N₀"]
end
N_sens -->|"dies at rate k"| Death["☠️ Death"]
N_res -->|"grows at rate g"| Growth["📈 Growth"]
Tumor -.->|"What we measure"| Obs["🔬 Observed:<br/>Nt"]
style N_sens fill:#ffcccc,stroke:#cc0000
style N_res fill:#ccffcc,stroke:#00cc00
style Obs fill:#cce5ff,stroke:#0066cc
tumor_burden_model = @model begin
@metadata begin
desc = "Tumor Burden Model for NSCLC Chemotherapy"
timeu = u"d" # Time unit is days
end
@param begin
# Typical values (population parameters)
tvf ∈ RealDomain(lower = 0.0, upper = 1.0)
tvg ∈ RealDomain(lower = 0.0)
tvk ∈ RealDomain(lower = 0.0)
# Inter-individual variability (diagonal covariance matrix)
# Values on the transformed scale (logit for f, log for g and k)
Ω ∈ PDiagDomain(3)
# Residual error (standard deviation)
σ ∈ RealDomain(lower = 0.0)
end
@random begin
# Random effects vector (one per parameter)
η ~ MvNormal(Ω)
end
@covariates treatment
@pre begin
# Transform random effects to individual parameters
# f: logit-normal transformation (bounded 0-1)
f = logistic(logit(tvf) + η[1])
# g: log-normal transformation (positive value)
g = tvg * exp(η[2])
# k: log-normal transformation (positive value)
# Additionally, treatment affects death rate:
# treatment = false → control arm (no killing)
# treatment = true → treatment arm (killing at rate tvk * exp(η[3]))
k = treatment * tvk * exp(η[3])
end
# Initial conditions (normalized tumor = 1)
@init begin
N_sens = f # Sensitive cells
N_res = 1 - f # Resistant cells
end
@dynamics begin
# Sensitive cells: die at rate k (0 if control)
N_sens' = -k * N_sens
# Resistant cells: grow at rate g
N_res' = g * N_res
end
@derived begin
# Total normalized tumor size
Nt := @. N_sens + N_res
# Observation with residual error
tumor_size ~ @. Normal(Nt, σ)
end
endPumasModel
Parameters: tvf, tvg, tvk, Ω, σ
Random effects: η
Covariates: treatment
Dynamical system variables: N_sens, N_res
Dynamical system type: Matrix exponential
Derived: tumor_size
Observed: tumor_size
The @dynamics block defines our ODEs:
N_sens' = -k * N_sens: Sensitive cells decrease exponentially with rate kN_res' = g * N_res: Resistant cells grow exponentially with rate g
The @derived block defines what we observe: total tumor = sensitive + resistant cells.
5 Part 2: Running Identifiability Analysis
5.1 Step 1: Extract the ODE System
Pumas models contain an underlying ModelingToolkit ODE system. This is what we’ll analyze for identifiability.
ode_system = tumor_burden_model.sysModel ##Pumas#232: Equations (2): 2 standard: see equations(##Pumas#232) Unknowns (2): see unknowns(##Pumas#232) N_sens(t) N_res(t) Parameters (3): see parameters(##Pumas#232) f g k
What happened: We extracted the symbolic ODE system from the Pumas model. This contains all the equations in a form that StructuralIdentifiability.jl can analyze mathematically.
5.2 Step 2: Define What We Can Measure
In clinical practice, we can only measure certain quantities. The choice of measurements directly affects which parameters are identifiable.
The same model can have different identifiability properties depending on what you measure. Adding a new biomarker or measurement can make previously unidentifiable parameters identifiable.
Which compartments receive inputs (doses) can also affect structural identifiability. In this tutorial, we do not consider inputs. To include them, you would define a separate ModelingToolkit ODE system where dose terms appear as variables that are neither ODE states nor outputs — StructuralIdentifiability.jl automatically treats these as known inputs. See the StructuralIdentifiability.jl documentation for details.
We create symbolic variables for measurements. As the existing variables in the ODE system, they depend on time.
t = ModelingToolkit.independent_variable(ode_system)
@variables Nt(t) N_sens_obs(t) N_res_obs(t)Here, @variables Nt(t) creates a symbolic variable Nt that depends on time variable t, the independent variable already present in the ODE system.
We consider two measurement scenarios:
In the first scenario, only total tumor burden is measured (most common clinically):
measured_total_only = [Nt ~ ode_system.N_sens + ode_system.N_res]Nt ~ ode_system.N_sens + ode_system.N_res defines the measurement equation: “Nt equals the sum of N_sens and N_res”
The ~ operator means “equals” in measurement equations. It connects the name of what we observe (left side) and the model expression being measured (right side).
In the second scenario, both populations of treatment-sensitive and treatment-resistant cells are measured separately (e.g., with biomarkers):
measured_both = [
N_sens_obs ~ ode_system.N_sens,
N_res_obs ~ ode_system.N_res,
]5.3 Step 3: Run Global Identifiability Analysis
Now we ask the key question: Can we estimate f, g, and k from total tumor measurements?
result_total = assess_identifiability(
ode_system;
measured_quantities = measured_total_only,
loglevel = Logging.Warn,
)OrderedCollections.OrderedDict{SymbolicUtils.BasicSymbolic{Real}, Symbol} with 5 entries:
N_sens(t) => :locally
N_res(t) => :locally
f => :nonidentifiable
g => :locally
k => :locally
6 Part 3: Interpreting the Results
6.1 Understanding the Output
Looking at the output above:
| Parameter | Status | What It Means |
|---|---|---|
| f | nonidentifiable |
The sensitive fraction CANNOT be estimated from total tumor data alone |
| g | locally |
The growth rate has a finite number of possible values (ambiguous) |
| k | locally |
The death rate has a finite number of possible values (ambiguous) |
| N_sens(t) | locally |
The sensitive cell trajectory is locally identifiable |
| N_res(t) | locally |
The resistant cell trajectory is locally identifiable |
The key finding: When we only measure total tumor burden, the parameter f (sensitive fraction) is completely non-identifiable! This means infinitely many values of f would produce the exact same tumor size curve.
6.2 Why is f Non-Identifiable?
Think about it intuitively: If we only see total tumor size, we can’t distinguish between:
- 80% sensitive cells dying fast, 20% resistant cells growing slow
- 60% sensitive cells dying medium, 40% resistant cells growing medium
Many combinations of (f, g, k) give the same total tumor curve!
6.3 What About Measuring Both Populations?
Let’s see if measuring both populations separately helps:
result_both = assess_identifiability(
ode_system;
measured_quantities = measured_both,
loglevel = Logging.Warn,
)OrderedCollections.OrderedDict{SymbolicUtils.BasicSymbolic{Real}, Symbol} with 5 entries:
N_sens(t) => :globally
N_res(t) => :globally
f => :nonidentifiable
g => :globally
k => :globally
Look at how the results changed:
| Parameter | Total Only | Both Populations |
|---|---|---|
| f | ❌ nonidentifiable | ❌ nonidentifiable |
| g | ⚠️ locally | ✅ globally |
| k | ⚠️ locally | ✅ globally |
| N_sens(t) | ⚠️ locally | ✅ globally |
| N_res(t) | ⚠️ locally | ✅ globally |
Key insight: By measuring both cell populations (perhaps using flow cytometry or specific biomarkers), we go from locally identifiable to globally identifiable for g and k!
But wait: Parameter f is STILL non-identifiable! Why?
6.4 Why is f Still Non-Identifiable?
The parameter f represents the initial fraction of sensitive cells. But look at our initial conditions:
N_sens(0) = f
N_res(0) = 1 - f
If we measure both N_sens and N_res at time 0, we directly observe f! The issue is that f isn’t really a dynamic parameter — it’s embedded in the initial conditions. The analysis treats initial conditions and parameters differently.
In practice, you would either:
- Estimate
fdirectly from the first measurement - Treat it as a known value
- Reparameterize so initial conditions become explicit parameters
7 Part 4: Finding Identifiable Combinations
When individual parameters aren’t identifiable, we can find combinations of parameters that ARE identifiable. This is extremely useful for reparameterization.
find_identifiable_functions(
ode_system;
measured_quantities = measured_total_only,
loglevel = Logging.Warn,
)2-element Vector{Num}:
g - k
g*k
The output shows:
- g - k: The difference between growth and death rates is identifiable
- g × k: The product of growth and death rates is identifiable
What this means practically:
Even though we can’t estimate g and k individually from total tumor data, we CAN estimate:
The net rate difference (g - k): How much faster resistant cells grow than sensitive cells die. If positive, tumor eventually grows; if negative, it shrinks.
The product g × k: From this, we can derive the geometric mean √(g×k), which represents a “typical” rate between g and k.
Example: If the true values were g = 0.04/day and k = 0.16/day:
- Product: g × k = 0.0064/day² (directly identifiable)
- Geometric mean: √(0.0064) = 0.08/day (derived from product)
- Difference: g - k = -0.12/day (tumor shrinks overall)
Clinical interpretation: We can determine the NET tumor behavior (is it overall growing or shrinking?) but not the individual rates.
Reparameterization strategy: Instead of estimating (g, k), estimate (g-k, g×k) and back-calculate if needed. Or fix one rate to a literature value and estimate the other.
8 Part 5: Visualization
Let’s create visualizations to communicate identifiability findings clearly.
8.1 Compile Results into DataFrames
# Tumor burden results
result_total_df = DataFrame(
parameter = [string(p) for p in keys(result_total)],
status = map(string, values(result_total))
)
result_both_df = DataFrame(
parameter = [string(p) for p in keys(result_both)],
status = map(string, values(result_both))
)
tumor_results = outerjoin(
result_total_df,
result_both_df;
on = :parameter,
renamecols = "_total" => "_both",
)
simple_table(
tumor_results,
[
:parameter => "Parameter",
:status_total => "Status (total only)",
:status_both => "Status (both populations)",
],
)| Parameter | Status (total only) | Status (both populations) |
| N_sens(t) | locally | globally |
| N_res(t) | locally | globally |
| f | nonidentifiable | nonidentifiable |
| g | locally | globally |
| k | locally | globally |
8.2 Identifiability Status Chart
long_tumor_results = vcat(
result_total_df,
result_both_df;
source = :scenario => [:total, :both],
)
specs = data(long_tumor_results) * mapping(
:parameter => "Parameter",
:scenario => renamer(:both => "Both populations", :total => "Total only") => "Measurement scenario",
color = :status => renamer(
"globally" => "Globally identifiable",
"locally" => "Locally identifiable",
"nonidentifiable" => "Non-identifiable",
) => "Status",
) * visual(Scatter; markersize = 30, marker = :rect)
draw(
specs,
scales(Color = (; palette = [:green, :orange, :red]));
axis = (; xticklabelrotation = π / 4, yticklabelrotation = π / 2),
)Row 1 (Total Only): When we only measure total tumor size:
fis non-identifiable (red) — cannot be estimatedg,k, state variables are locally identifiable (orange) — finite ambiguity
Row 2 (Both Populations): When we measure both cell types:
fis STILL non-identifiable — it’s a structural issueg,k, state variables become globally identifiable (green) — unique solutions!
Takeaway: Adding measurements dramatically improves identifiability for g and k.
8.3 Summary: Impact of Measurements
# Count parameters in each category, per scenario
tumor_results_counts = vcat(
combine(groupby(result_both_df, :status), nrow => :count),
combine(groupby(result_total_df, :status), nrow => :count);
source = :scenario => [:total, :both],
)
status_mapping = :status => renamer(
"globally" => "Globally identifiable",
"locally" => "Locally identifiable",
"nonidentifiable" => "Non-identifiable",
) => "Status"
specs = data(tumor_results_counts) * mapping(
:scenario => renamer(:both => "Both populations", :total => "Total only") => "Measurement scenario",
:count => "Number of parameters/states",
stack = status_mapping,
color = status_mapping,
) * visual(BarPlot)
draw(specs, scales(Color = (; palette = [:green, :orange, :red])))9 Part 6: Practical Recommendations
Based on what we’ve learned, here are actionable guidelines for the tumor burden model and beyond.
Run identifiability analysis as the first step after building your model, not as a debugging step when estimation fails. This saves enormous time and frustration.
Workflow:
Build model → Check identifiability → Fix issues → THEN estimate
When planning a clinical study:
- List the parameters you need to estimate
- Run identifiability analysis with your planned measurements
- If key parameters are non-identifiable, consider adding biomarkers
- Balance cost/invasiveness against identifiability gains
Example: In the tumor burden model, adding biomarkers to distinguish sensitive vs resistant cells would make g and k globally identifiable.
When you find non-identifiable parameters, you have four options:
| Strategy | When to Use | Example |
|---|---|---|
| Fix to literature value | Parameter well-characterized elsewhere | Fix k to published death rate |
| Reparameterize | Only combinations are identifiable | Estimate (g-k, g×k) instead of (g, k) |
| Add measurements | Feasible in study design | Add biomarker for cell types |
| Use informative Bayesian priors | Have prior knowledge | Strong prior on growth rate |
For complex models (>5 parameters, >3 state variables):
- Start with local identifiability — fast screening
- If locally identifiable — follow up with global analysis for complete picture
- If locally non-identifiable — definitely non-identifiable globally, fix first
This saves computational time on large models.
Remember: Structural identifiability is necessary but not sufficient! (Wieland et al., 2021)
| Type | Assumes | Answers |
|---|---|---|
| Structural | Perfect data, infinite time | Can parameters theoretically be identified? |
| Practical | Real data, finite samples | Will parameters be well-estimated in practice? |
Even structurally identifiable parameters may be practically non-identifiable due to:
- Limited data
- Poor sampling times
- High measurement noise
- Parameter values near boundaries
Always follow up with practical identifiability (FIM, profile likelihood, bootstrap) on your actual data (Raue et al., 2009).
10 Summary
10.1 Key Takeaways
Structural identifiability tells you whether parameters CAN be estimated, assuming perfect data
Three classifications matter:
- Globally identifiable: Safe to estimate (unique solution)
- Locally identifiable: Estimation depends on starting values (finite solutions)
- Non-identifiable: Cannot estimate (infinite solutions)
Measurements determine identifiability: The same model can be identifiable or not depending on what you observe
When parameters aren’t identifiable individually, look for identifiable combinations (reparameterization)
Always check BEFORE estimation to avoid wasted effort on impossible problems
10.2 Tumor Burden Model Findings
| Finding | Practical Implication |
|---|---|
f non-identifiable from total tumor |
Can’t separate sensitive/resistant fractions without biomarkers |
g-k and g×k identifiable |
Can estimate net tumor behavior, not individual rates |
| Measuring both populations helps | Biomarkers for cell types dramatically improve estimation |
10.3 Quick Reference: Key Functions
| Function | Purpose |
|---|---|
model.sys |
Extract ODESystem from Pumas model |
@variables x(t) |
Define symbolic measurement variable |
assess_identifiability() |
Global analysis (globally/locally/non) |
assess_local_identifiability() |
Local analysis (identifiable yes/no) |
find_identifiable_functions() |
Find identifiable parameter combinations |
10.4 What’s Next?
In Tutorial 4: Copula Vpop Generation, we’ll learn how to generate realistic virtual populations using Gaussian copulas that preserve the parameter correlations from NLME estimation.