DEVELOPMENT

Model Progress

AUC-ROC on public-ticker test set: 0.853 [95% CI 0.812, 0.886]. Lo et al. 2019 method reproduced on same data: 0.807. Leakage-audited, private-inflation removed, causal designation analysis completed Apr 2026. 240 f14 labels migrated from LLM to deterministic SEC 8-K extraction. Each milestone represents a measurable improvement validated on a held-out test set.

AUC-ROC PROGRESSION

Baseline

0.614

Feature engineering

0.656

+42

Stacked ensemble

0.680

+24

Label quality

0.705

+25

LLM extraction

0.727

+22

Data expansion

0.773

+46

Drugs@FDA cross-ref

0.783

+10

Learned weights

0.864

+81

Sign-constrained L1

0.827

Drop private from train

0.836

BTD interactions

0.834

Isotonic calibration

0.845

+11

Factor audit

0.845

Effect size + foreign approvals

0.845

Leakage audit

0.845

Drop private from test

0.809

MoA class feature

0.814

SHAP-derived interactions + causal analysis

0.813

8-K text parsing replaces LLM f14 labels

0.831

+18

Split f16 prior approval into two features

0.840

Per-drug CRL history feature (f51)

0.850

+10

Data-quality sweep + modality flag

0.853

Backtest hardening + point-in-time leak audit

0.846

MILESTONES

BaselineAUC 0.614 · 549 events

14-factor log-odds model, normalized weights

Feature engineeringAUC 0.656 · 549 events

8 new features: prior approval, sponsor track record, modality, biomarker

Stacked ensembleAUC 0.680 · 549 events

XGBoost meta-learner + temperature scaling, CTO dataset integration

Label qualityAUC 0.705 · 549 events

Cleaned CTO labels (60% -> 98% accuracy), removed log-odds compression

LLM extractionAUC 0.727 · 723 events

235 trial result labels from SEC 8-K + PubMed research

Data expansionAUC 0.773 · 723 events

723 events, CRL reason tagging, concept drift correction

Drugs@FDA cross-refAUC 0.783 · 954 events

954 events via NME matching, 511 f14 labels, CRL type features

Learned weightsAUC 0.864 · 2032 events

2,026 events, L1 logistic regression replaces hand-set weights

Sign-constrained L1AUC 0.827 · 2032 events

Added resubmission + CRL-type features; dropped collinear sign-flips for a defensible fit

Drop private from trainAUC 0.836 · 1660 events

Excluded 372 private-ticker events (99% trivial approvals) from training; L1 collapses to 5 defensible coefficients

BTD interactionsAUC 0.834 · 1660 events

Added 4 BTD x red-flag interactions — Phase 3 AUC jumps 0.744 -> 0.794, prior-CRL AUC 0.705 -> 0.783

Isotonic calibrationAUC 0.845 · 1660 events

Post-hoc isotonic recalibration — Brier 0.122 -> 0.114, ECE 6.8% -> 4.2%, corrects systematic overconfidence

Factor auditAUC 0.845 · 1660 events

Dropped 5 broken factors (f8, f11, f18, f20, f22), wired f16/f17/miss_f14 as live factors, fixed AdComm naming. Every L1-learned weight now actually applies in live scoring.

Effect size + foreign approvalsAUC 0.845 · 1660 events

Added graded effect-size magnitude (HR/p-value) and EMA/PMDA prior-approval factor. Brier 0.113, ECE 3.7% (best calibration to date).

Leakage auditAUC 0.845 · 1656 events

Stripped "(CRL)" / "(Resubmission)" tags from 47 drug names that leaked outcome. Dropped 6 duplicate events with conflicting labels. Train-test gap fell from +0.063 to +0.036. 95% CI [0.807, 0.878], permutation p<0.001.

Drop private from testAUC 0.809 · 1497 events

Previous headline included 159 private-ticker events (99% trivial approvals) that inflated AUC by ~0.03. Public-only test AUC is the honest number: 0.809 [CI 0.761, 0.847] — essentially matching Lo et al. 2019 (0.810). Full test AUC remains 0.834 for reference.

MoA class featureAUC 0.814 · 1497 events

Added ChEMBL-derived mechanism-of-action class feature (f44) and missingness indicator (f45). Nuclear-receptor drugs approve at 70% vs tyrosine-kinase inhibitors at 92%.

SHAP-derived interactions + causal analysisAUC 0.813 · 1497 events

Added 5 non-linear interactions from XGBoost SHAP analysis (sponsor x pipeline, trial x priority review, endpoint x base rate). Closed 20% of L1-to-XGBoost AUC gap. Causal analysis: BTD/FTD/ODD are signals of difficulty, not approval accelerants. PoS funnel added. 61 tests, CI pipeline, Lo et al. head-to-head on same test set.

8-K text parsing replaces LLM f14 labelsAUC 0.831 · 1497 events

Built a deterministic SEC 8-K/6-K classifier (negation-scope masking, FDA-approval pattern bank, ratio-threshold aggregation across filings). Replaced 240 of 403 Perplexity-sourced trial-outcome labels with verdicts backed by quoted SEC filing text. 8-K labels tied LLM on realized-outcome accuracy (75.4% vs 75.8%) while restoring full primary-source provenance. Every f14 label now traces to SEC EDGAR, CT.gov, or is honestly flagged miss_f14=1.

Split f16 prior approval into two featuresAUC 0.840 · 1497 events

Recency sanity check revealed the model was overconfident on 5 of 22 2026 PDUFAs (CORT relacorilant, INCY Zynyz expansion, PHAR Joenja sNDA, PTCT Translarna, RGNX gene therapy). Root cause: f16_prior_approval conflated "sponsor has approved drugs before" with "this exact molecule is already on the US market." Split into f16a (sponsor competence, learned weight +0.75) and f16b (label expansion, learned weight -0.72 — opposite sign!). 2026-slice AUC lifted from 0.803 to 0.838, matching the pooled headline. CORT 0.977 -> 0.810, INCY 0.890 -> 0.779.

Per-drug CRL history feature (f51)AUC 0.850 · 1497 events

Added a drug-level CRL-history feature distinct from sponsor-level f32/f33. Translarna (ataluren) had been CRL'd twice before its 2026 PDUFA; no existing feature captured that this specific molecule keeps getting rejected. f51 counts prior CRLs for the same active ingredient (token-based match, capped at 3, normalized to [0,1]). Learned weight -2.31, third-largest magnitude in the model. PTCT Translarna: 0.969 -> 0.856 (CRL correctly predicted). 2026-slice AUC jumped to 0.863 — above the pooled headline. Brier drops back to 0.105, ECE back to 3.0%.

Data-quality sweep + modality flagAUC 0.853 · 1497 events

Data audit: fixed 5 weekend-dated catalyst events (FDA only acts on weekdays), resolved an f49 feature-ID collision, and shipped an INN-stem-based modality classifier. Added f52_is_novel_modality flag (gene therapy, cell therapy, oligonucleotide, oncolytic virus). Empirical: novel modalities approve at ~50-70% in training vs 83-92% for small molecules/mAbs. Expanded the L1 alpha grid to include 0.0015/0.002; 0.002 wins on AUC. Base-rate drift vs BIO 2011-2020 (oncology -16pp, heme -24pp) documented in methodology; isotonic recalibration absorbs it. AUC 0.850 -> 0.853, ECE 3.0% -> 2.9%.

Backtest hardening + point-in-time leak auditAUC 0.846 · 1909 events

Comprehensive look-ahead audit on every feature path. Closed two real leaks: (1) pivotal-trial primary_completion_date — 8.5% of test events were leaking Phase-3 results from up to 29 months in the future via _group_programs; (2) f25 CTO_PHASE3 lookup — same leak via NCT-keyed CTO scores that use post-trial features. Both fixed in _group_programs / _flatten_features. 8-K f14 extractor default also tightened (days_after 180 -> 0). Confirmed safe: openFDA, AdComm, CRL count, sponsor track record, cash/burn, foreign approvals. Honest test AUC 0.853 -> 0.846 [bootstrap 95% CI 0.79, 0.86]. Walk-forward AUC ranges 0.72-0.91 by year. Mega-pharma stratification: 0.96 AUC on PFE/JNJ/etc., 0.81 on small/mid-cap (the events that matter for analyst use). Brier 0.106, ECE 4.7%. 22/22 spot-checked labels match openFDA ground truth. Cleanest backtest the engine has ever had.

CURRENT METRICS

0.853

AUC-ROC

public-only test, 95% CI [0.81, 0.89]

85%

ACCURACY

optimal threshold

954

EVENTS

2006-2026

301

TEST SET

2021-2026 holdout

FACTORS

28 primary + 5 learned interactions

DATA SOURCES

live API feeds

BENCHMARKS

MODELEVENTSAUC

ApprovalAlpha (35 factors, same test set)5320.853

Lo et al. 2019 method (reproduced on our data)5320.807

Phase x Indication base rate5320.667

NEXT TARGETS

AUC 0.80Expand to 1,200+ events, market-implied probability feature

AUC 0.82FDA review document sentiment extraction, 1,500+ events

AUC 0.85Molecular structure features, proprietary data integration