Multimodal Health Data Modelling and Predictive Risk Evaluation
- AVE
- Oct 7
- 9 min read
Multimodal Health Data Modelling and Predictive Risk Evaluation
Produced by: Ave Bilişim – Research & Data Analytics Unit Reference: ATR-2024-09-21 Type: Internal Technical Report (Prior Work) Completion Date: 18 September 2024 Confidentiality: Internal – Abstract page may be made public as a summary Authors: Research & Data Analytics Unit (5 researchers) Ethics Note: All datasets were synthetic or de-identified. This is an analytics methodology study; it does not provide clinical recommendations.
1. Abstract
This internal report documents a multi-phase research programme to design, implement, and evaluate a multimodal health data modelling pipeline for predictive risk evaluation. The pipeline harmonises de-identified/synthetic health-like data to a common schema, engineers temporal and semantic features, and trains a calibrated ensemble that fuses (i) gradient-boosted tabular learners over labs/vitals/procedures/medications and (ii) a small transformer encoder over section-aware clinical-style notes. Reliability is enforced through probability calibration (temperature + isotonic), uncertainty-aware abstention, and decision-curve analysis. Interpretability is provided via SHAP attributions, section-level attention summaries, nearest prototypes, and counterfactual what-ifs (training-time only). On synthetic yet statistically realistic evaluations, the multimodal model improves discrimination (AUROC 0.89 vs 0.81 tabular-only), reduces calibration error (ECE 0.06 from 0.11), and shortens review times through explanations. The study focuses on methodology and governance: reproducible experiments, model cards, data sheets, fairness slices, change control, shadow scoring, and incident SOPs. Work was completed and archived in September 2024 as part of Ave Bilişim’s independent research portfolio.
2. Executive Summary
• Problem. Health-like operational data are heterogeneous: high-frequency time-series (labs/vitals), sparse coded events (procedures/medications), and unstructured notes. Siloed models miss cross-signal interactions and often produce uncalibrated scores, limiting trust when scarce review capacity must be allocated.
• Method. A five-layer architecture—ingestion → harmonisation → modelling → calibrated scoring → analyst UI—combines a lightweight text encoder (section-aware) with a tabular learner and fuses them under a calibrated meta-learner. Governance includes lineage, model cards, fairness monitoring, abstention policies, and SOPs.
• Outcome. On de-identified/synthetic data, AUROC improves from 0.81→0.89, Brier from 0.162→0.138, ECE from 0.11→0.06. Rolling scoring surfaces signals earlier; explanations reduce triage time. Positioning. A domain-agnostic methodology for safety-critical analytics, not a clinical decision tool. The pipeline is portable to other multimodal risk contexts where calibrated ranking and auditability matter.
3. Problem Decomposition and Formalisation
Let an encounter be a tuple
e=(Xlabs,Xvitals,Xproc,Xmed,N, tadm,tdis, dept).e=(X_{\text{labs}},X_{\text{vitals}},X_{\text{proc}},X_{\text{med}},N,\ t_{\text{adm}},t_{\text{dis}},\ \text{dept}).e=(Xlabs,Xvitals,Xproc,Xmed,N, tadm,tdis, dept).
Here XlabsX_{\text{labs}}Xlabs and XvitalsX_{\text{vitals}}Xvitals are multivariate time-series; Xproc,XmedX_{\text{proc}},X_{\text{med}}Xproc,Xmed are sparse coded sequences; NNN is de-identified, sectioned note text. The task is to estimate a calibrated risk
p^=P(y=1 ∣ e)∈[0,1],\hat p = P(y=1\,|\,e)\in[0,1],p^=P(y=1∣e)∈[0,1], for prioritising limited review capacity.
Fusion model. Denote tabular learner logit ztabz_{\text{tab}}ztab and text encoder logit ztextz_{\text{text}}ztext. A meta-learner fuses:
z=w1ztab+w2ztext+b,p^=Calib(z; T),z = w_1 z_{\text{tab}} + w_2 z_{\text{text}} + b,\quad \hat p = \mathrm{Calib}(z;\ T),z=w1ztab+w2ztext+b,p^=Calib(z; T),
with Calib\mathrm{Calib}Calib a temperature-scaled logistic followed by isotonic regression.
Calibration metrics. Expected Calibration Error (ECE): ECE=Σb=1B∣Sb∣n ∣acc(Sb)−conf(Sb)∣,\mathrm{ECE}=\sum_{b=1}^{B}\frac{|S_b|}{n}\ \left| \mathrm{acc}(S_b)-\mathrm{conf}(S_b)\right|,ECE=b=1ΣBn∣Sb∣ ∣acc(Sb)−conf(Sb)∣, Brier score =1nΣi(p^i−yi)2= \frac{1}{n}\sum_i (\hat p_i - y_i)^2=n1Σi(p^i−yi)2.
Operating modes. (i) Fixed workload (top-k% review list) maximises Recall@k; (ii) Fixed recall minimises workload at target recall. An abstention gate defers uncertain cases for manual review.
4. Data Assets, Ethics, and Harmonisation
4.1 Sources (de-identified/synthetic)

Ethics. No PHI; all data are synthetic or de-identified. Adversarial re-identification probes validate masking; retention and access obey least-privilege roles. This study is non-clinical and evaluates analytics methodology only.
4.2 Common data model & lineage
An OMOP-like schema maps codes and units to controlled vocabularies; time is normalised to UTC; encounter timelines pass integrity checks (no negative length, monotone timestamps). Lineage fields—dataset_id, transform_hash, feature_store_version—are attached to every record.
4.3 Quality gates and privacy checks
• Completeness: critical fields <5% missing (per table); imputation rules logged with flags.
• Consistency: unit harmonisation; timestamp monotonicity; duplicate suppression.
• Privacy: PHI tokens removed; adversarial probes score residual risk; threshold breaches trigger re-masking.
5. Feature Engineering (Expanded)
5.1 Time-series features (labs, vitals)
• Rolling means/variances; deltas from personal baseline; volatility indices; exponential moving averages.
• Abnormality counts per code (z-score > τ); deterioration composite combining multiple lab families.
• Rate-of-change and concavity (second derivative) markers for sudden drifts.
5.2 Sparse coded events (procedures/medications)
• Frequency & recency decay embeddings; co-occurrence motifs; care-path transitions (dept→dept).
• Sparsity-aware encoders: hashed features with collision monitoring; explicit “unknown/rare” flags.
5.3 Text semantics (de-identified notes)
• Section-aware transformer encodes H/A/P sections; pooled embeddings per encounter.
• Negation/uncertainty cues; concept span density; lightweight sentiment orientation of assessment sections.
• Readability/structure markers: length, section completeness, repetition (template detection).
5.4 Cross-modal consistency features
A penalty factor c<1c<1c<1 when text implies deterioration but vitals/labs improve (and vice-versa a bounded bonus b>1b>1b>1). Consistency stabilises the fused score across modalities.
5.5 Selection and stability
• SHAP ranking over validation folds; correlation pruning ∣ρ∣>0.85|\rho|>0.85∣ρ∣>0.85.
• Stability filter: features with >25% rank variance across folds removed.
• Final set ≈ 180 features (≈120 tabular/time; 60 text/meta/consistency).
Toy example. Assessment notes hint “increasing dyspnea”; vitals show downward SpO₂ trend and rising RR → cross-modal boost. If notes are neutral and vitals stable, fusion falls back to tabular baseline.
6. Model Stack, Training, and Calibration
6.1 Components
1. Tabular learner: LightGBM/XGBoost on engineered time-series; monotonic constraints for clinically monotone markers (e.g., increasing abnormality → non-decreasing risk).
2. Text encoder: a small distilled transformer (6 layers × 256 hidden) fine-tuned on proxy tasks (e.g., disposition) over synthetic notes; section masks as inputs.
3. Fusion: logistic meta-learner over (ztab,ztext)(z_{\text{tab}}, z_{\text{text}})(ztab,ztext) + confidence signals; optional attention gate if one modality dominates confidence.
4. Calibration: temperature scaling (T chosen on validation) + isotonic regression; reliability diagrams monitored.
5. Abstention: uncertainty-aware gate; low-confidence predictions routed to manual queue.
6.2 Training protocol
• Splits: time-ordered 80/20; 5-fold time-series CV; final chronological 10% as holdout.
• Imbalance: focal loss (γ=2) and threshold optimisation by Youden’s J; operating points tuned to workload/recall targets.
• HP search: Bayesian TPE over trees, depth, lr (tabular) and lr/maxlen (text).
• Reproducibility: deterministic seeds; point-in-time joins prevent leakage
6.3 Hyper-parameters (illustrative)
• LightGBM: trees=2000; depth=8–12; lr=0.03; L2=1e-2; colsample=0.8; subsample=0.8.
• Transformer: 6 layers; 8 heads; 256-d hidden; maxlen=256; dropout=0.1; lr=2e-5; warmup=2k steps; batch=64; epochs=6.
• Calibration: temperature T≈1.6T\approx1.6T≈1.6–1.9; isotonic bins=15.
7. Evaluation Design and Metrics
Primary: AUROC, AUPRC, F1, Brier, ECE, Latency. Operational: time-to-alert (rolling scoring), decision-curve net benefit, analyst triage time, abstention rate.
Formulas.
• Brier=1nΣi(p^i−yi)2\text{Brier}=\frac{1}{n}\sum_i(\hat p_i-y_i)^2Brier=n1Σi(p^i−yi)2.
• ECE=Σb∣Sb∣n∣acc(Sb)−conf(Sb)∣\text{ECE}=\sum_b \frac{|S_b|}{n}\left|\text{acc}(S_b)-\text{conf}(S_b)\right|ECE=Σbn∣Sb∣∣acc(Sb)−conf(Sb)∣.
• Decision curve: NB=TPN−FPN⋅pt1−pt\text{NB} = \frac{TP}{N} - \frac{FP}{N}\cdot\frac{p_t}{1-p_t}NB=NTP−NFP⋅1−ptpt at threshold ptp_tpt.
Holdout & robustness. The final 10% time chunk is held out; synthetic drift is injected (lab distribution shift, documentation style change) to test reliability; ablations remove modalities/features to measure marginal contributions.
8. Results
8.1 Aggregate performance (holdout)

Interpretation. Text captures semantic cues not present in structured signals; calibration improves reliability of probability outputs; latency remains within operational bounds.
8.2 Decision-curve and workload trade-offs
Across thresholds corresponding to 10–40% workload, the multimodal model yields higher net benefit than baselines. With rolling scoring every 15 minutes, time-to-alert improves; abstention reduces brittle decisions under drift.
8.3 Error & slice analysis
• FN-Sparse: short-stay encounters with few labs → add priors; use uncertainty abstention.
• FP-Template: templated phrases inflate risk → detect repetition; down-weight with structure markers.
• FP-Noise: bursty vitals without adverse outcome → add volatility context windows.
• Slices: parity gaps <4% across synthetic bands (age proxies, LOS, dept); mitigation playbooks defined if gaps exceed tolerance.
9. Interpretability and Analyst Workflow
9.1 Explanations
• SHAP bars for tabular features; section-level attention summaries for text (aggregate—no token-level PHI).
• Prototype retrieval highlights similar historical encounters;
• Counterfactuals (training-time) show minimal feature changes that would reduce risk below threshold—used to explain, not prescribe.
9.2 Analyst UI & triage
Ranked queue with per-case rationale: top SHAP features, attention summary, consistency flags, confidence/abstention status. Actions (confirm/dismiss/watch) are logged to a feedback store; thresholds reviewed weekly under governance approval.
10. Case Studies (Synthetic)
Case H-1 — Early multimodal signal
Context. Assessment notes: “worsening shortness of breath”; labs: mild elevation; vitals: rising RR, falling SpO₂ trend. Signals. Text attention high on “worsening”; tabular deterioration composite ↑. Outcome. Fused logit 2.0 ⇒ p^=0.88\hat p=0.88p^=0.88; alerted ~4 hours before a rules surrogate would. Explanation. SHAP: SpO₂ delta, RR trend, lab composite; attention: Assessment section.
Case H-2 — Uncertainty-aware abstention
Context. Stable labs; notes neutral; sparse history. Signals. Conflicting minor features; low confidence. Outcome. Abstained; manual review later confirms negative outcome—good abstention. Lesson. Abstention improves reliability without inflating workload.
Case H-3 — Text style drift handled by calibration
Context. New templated note style; initial FP uptick. Mitigation. Structure marker boosts; recalibrated temperature; FP rate returns to baseline.
11. System Architecture, MLOps, and Security
11.1 Services
• Ingestion & harmonisation: batch + streaming, schema registry, validation.
• Feature store: point-in-time joins; Redis cache for hot features.
• Model services: PyTorch/LightGBM micro-services behind gRPC; model registry with lineage and approvals.
• Scoring: calibrated meta-learner; shadow scoring before promotion; canary deployment with rollback guards.
• Observability: metrics (Prometheus), traces (OpenTelemetry), structured logs; dashboards for reliability diagrams and drift monitors.
11.2 Performance & resilience
Throughput 30k encounters/min batch; streaming prototype for rolling updates; p95 latency within bounds; autoscaling and bounded queues; priority lanes for critical pipelines.
11.3 Security
Container hardening; image scanning; secrets in KMS; mTLS; least-privilege IAM; network policies and rate limits; audit logs retained per retention policy.
12. Governance, Fairness, and Documentation
• Model cards record scope, datasets, metrics, slices, limitations, failure modes, and approval history.
• Data sheets document provenance, de-identification, quality rules, retention.
• Fairness monitors track parity gaps; remediation (re-weighting, threshold per slice) requires governance sign-off.
• Change control enforces CRs, shadow periods, acceptance gates, and rollback criteria.
• Decision trace attached to every score (model id, data version, calibration version, commit hash).
13. Standard Operating Procedures (SOPs)
SOP-01: Model change control
1. Open Change Request (CR) with scope/KPIs.
2. Offline evaluation (frozen splits), calibration and fairness review.
3. Governance approval.
4. Shadow scoring ≥1 week; compare live KPIs vs incumbent.
5. Canary rollout with automated rollback triggers.
6. Full promotion; archive artefacts, publish decision note.
SOP-02: Incident response
• Triggers: drift spikes, FPR uptick, latency breaches.
• Actions: freeze promotion; revert; root-cause analysis (≤48h); corrective actions; post-mortem.
SOP-03: Threshold & abstention governance
• Weekly review of alert budgets, Recall@budget, FPR, abstention rate, triage times; threshold/abstention bounds adjusted with sign-off; change log versioned.
14. Extended Technical Notes
14.1 Calibration & uncertainty
Temperature TTT minimises NLL on validation:
minTΣi−logσ(zi/T).\min_T \sum_i -\log \sigma(z_i/T).TminiΣ−logσ(zi/T).
Isotonic smoothing refines local mis-calibration. Conformal add-on (optional) yields coverage guarantees at the cost of extra abstentions.
14.2 Robustness to drift
• Style drift in notes → structure markers + periodic text-encoder refresh.
• Distribution drift in labs → PSI monitors + scheduled recalibration.
• Data gaps → abstention and conservative priors, documented in model card.
14.3 Reproducibility
Deterministic seeds; environment manifests; versioned artefacts; exact data snapshots; CI checks for leakage (point-in-time auditors).
15. Limitations
• Synthetic labels proxy real outcomes; external validation is required before any operational use.
• Variable text quality (copy-forward/templating) can bias signals; monitored via repetition metrics.
• Long-horizon dependencies only partially captured; temporal transformers are planned.
• Interpretation bound: explanations support analyst sense-making but are not clinical advice.
16. Future Work and Roadmap
1. Weak supervision to exploit unlabelled notes and distant supervision signals.
2. Temporal transformers (long-range) for extended sequences.
3. Online calibration with streaming reliability and conformal thresholds.
4. Domain adaptation for new departments/documentation styles.
5. Public 2–3 page abstract sharing non-sensitive methodology learnings.
17. Conclusions
We present a reproducible, calibrated, and interpretable multimodal risk modelling pipeline for de-identified/synthetic health-like data. By fusing structured trajectories with section-aware semantics and enforcing calibration plus abstention, the approach improves discrimination and reliability while preserving auditability and governance. The system is not a clinical tool; it is a methodology for safety-critical analytics use-cases where calibrated ranking, explanations, and operational governance are essential.
The research was finalised and archived on 18 September 2024 as an internal study by Ave Bilişim’s Research & Data Analytics Unit.
Appendices
Appendix A – Metric Definitions (Extended)
• AUROC: threshold-free ranking quality; sensitive to separability.
• AUPRC: informative under class imbalance; area under precision–recall curve.
• F1: 2PR/(P+R)2PR/(P+R)2PR/(P+R), balances FP/FN.
• Brier: mean squared error of probability predictions.
• ECE: average absolute gap between confidence and empirical accuracy across bins.
• Latency: ingestion→score end-to-end; report p50/p95.
• Net Benefit (decision curve): utility that weights TP vs FP at chosen threshold.
• Abstention rate: share of cases routed to manual review.
Appendix B – Model Parameters
B.1 Tabular learner (LightGBM) Trees=2000; Depth=8–12; LR=0.03; L2=1e-2; Colsample=0.8; Subsample=0.8; Monotone constraints on key abnormality features.
B.2 Text encoder (Transformer) Layers=6; Heads=8; Hidden=256; Maxlen=256; Dropout=0.1; LR=2e-5; Warmup=2k; Batch=64; Epochs=6; Section masks: H/A/P.
B.3 Fusion & Calibration Meta-learner: Logistic(L2=1.0). Temperature T≈1.6T\approx1.6T≈1.6–1.9; Isotonic bins=15; validation=last 10% chronologically.
B.4 Abstention Uncertainty metric from calibrated variance + distance to decision boundary; abstain if p∈[pl,pu]p\in[p_l,p_u]p∈[pl,pu] around threshold; bounds tuned to budget.
Appendix C – Data Schema & Quality Rules
Core tables: Encounters, Labs, Vitals, Procedures, Medications, Notes, Outcomes. Fields: encounter_id, patient_hash (de-id), timestamp_utc, unit, code, value, dept, section, text_hash. Rules: timestamp monotonicity; unit harmonisation; null thresholds; duplication checks; lineage fields; privacy flags.
Appendix D – De-identification Protocol
• Masking: direct identifiers removed; quasi-identifiers generalised.
• Validation: adversarial re-ID probes; residual risk thresholds.
• Governance: approvals logged; retention limits; access reviews; periodic re-assessment.
Appendix E – SOP: Model Lifecycle & Incident Handling
Change control: CR → offline eval (metrics, calibration, slices) → governance sign-off → shadow → canary → promote/rollback. Incident: detect (drift/latency/FPR), freeze, revert, RCA within 48h, corrective action, post-mortem. Documentation: versioned model cards & data sheets; decision logs; threshold/abstention change notes.

Appendix G – Pseudocode
for enc in rolling_stream:
feats = assemble_features(enc) # time-series + codes + text + consistency
z_tab = gbdt.logit(feats.tabular)
z_txt = txt_enc.logit(feats.text_emb, feats.section_mask)
z_fuse = w1*z_tab + w2*z_txt + b
p_raw = sigmoid(z_fuse / T_current) # temperature scaling
p_cal = isotonic_refine(p_raw) # final calibrated probability
if should_abstain(p_cal, feats.uncertainty):
route(enc, "manual_review")
elif p_cal >= threshold:
emit_alert(enc, p_cal, explain(feats)) # SHAP + attention + prototypes
else:
continue
Appendix H – Glossary
Abstention: Deferring low-confidence cases to human review. Calibration: Agreement between predicted probabilities and empirical frequencies. Conformal prediction: Non-parametric method providing coverage guarantees. Decision-curve analysis: Evaluates clinical/operational utility across thresholds. SHAP: Feature attribution method for tabular models. Section-aware encoding: Text encoder that respects note sections (H/A/P).
Comments