DEVELOPMENT

Model Progress

AUC-ROC on public-ticker test set: 0.853 [95% CI 0.812, 0.886]. Lo et al. 2019 method reproduced on same data: 0.807. Leakage-audited, private-inflation removed, causal designation analysis completed Apr 2026. 240 f14 labels migrated from LLM to deterministic SEC 8-K extraction. Each milestone represents a measurable improvement validated on a held-out test set.

AUC-ROC PROGRESSION
Baseline
0.614
Feature engineering
0.656
+42
Stacked ensemble
0.680
+24
Label quality
0.705
+25
LLM extraction
0.727
+22
Data expansion
0.773
+46
Drugs@FDA cross-ref
0.783
+10
Learned weights
0.864
+81
Sign-constrained L1
0.827
Drop private from train
0.836
+9
BTD interactions
0.834
Isotonic calibration
0.845
+11
Factor audit
0.845
Effect size + foreign approvals
0.845
Leakage audit
0.845
Drop private from test
0.809
MoA class feature
0.814
+5
SHAP-derived interactions + causal analysis
0.813
8-K text parsing replaces LLM f14 labels
0.831
+18
Split f16 prior approval into two features
0.840
+9
Per-drug CRL history feature (f51)
0.850
+10
Data-quality sweep + modality flag
0.853
+3
Backtest hardening + point-in-time leak audit
0.846
MILESTONES
BaselineAUC 0.614 · 549 events
14-factor log-odds model, normalized weights
Feature engineeringAUC 0.656 · 549 events
8 new features: prior approval, sponsor track record, modality, biomarker
Stacked ensembleAUC 0.680 · 549 events
XGBoost meta-learner + temperature scaling, CTO dataset integration
Label qualityAUC 0.705 · 549 events
Cleaned CTO labels (60% -> 98% accuracy), removed log-odds compression
LLM extractionAUC 0.727 · 723 events
235 trial result labels from SEC 8-K + PubMed research
Data expansionAUC 0.773 · 723 events
723 events, CRL reason tagging, concept drift correction
Drugs@FDA cross-refAUC 0.783 · 954 events
954 events via NME matching, 511 f14 labels, CRL type features
Learned weightsAUC 0.864 · 2032 events
2,026 events, L1 logistic regression replaces hand-set weights
Sign-constrained L1AUC 0.827 · 2032 events
Added resubmission + CRL-type features; dropped collinear sign-flips for a defensible fit
Drop private from trainAUC 0.836 · 1660 events
Excluded 372 private-ticker events (99% trivial approvals) from training; L1 collapses to 5 defensible coefficients
BTD interactionsAUC 0.834 · 1660 events
Added 4 BTD x red-flag interactions — Phase 3 AUC jumps 0.744 -> 0.794, prior-CRL AUC 0.705 -> 0.783
Isotonic calibrationAUC 0.845 · 1660 events
Post-hoc isotonic recalibration — Brier 0.122 -> 0.114, ECE 6.8% -> 4.2%, corrects systematic overconfidence
Factor auditAUC 0.845 · 1660 events
Dropped 5 broken factors (f8, f11, f18, f20, f22), wired f16/f17/miss_f14 as live factors, fixed AdComm naming. Every L1-learned weight now actually applies in live scoring.
Effect size + foreign approvalsAUC 0.845 · 1660 events
Added graded effect-size magnitude (HR/p-value) and EMA/PMDA prior-approval factor. Brier 0.113, ECE 3.7% (best calibration to date).
Leakage auditAUC 0.845 · 1656 events
Stripped "(CRL)" / "(Resubmission)" tags from 47 drug names that leaked outcome. Dropped 6 duplicate events with conflicting labels. Train-test gap fell from +0.063 to +0.036. 95% CI [0.807, 0.878], permutation p<0.001.
Drop private from testAUC 0.809 · 1497 events
Previous headline included 159 private-ticker events (99% trivial approvals) that inflated AUC by ~0.03. Public-only test AUC is the honest number: 0.809 [CI 0.761, 0.847] — essentially matching Lo et al. 2019 (0.810). Full test AUC remains 0.834 for reference.
MoA class featureAUC 0.814 · 1497 events
Added ChEMBL-derived mechanism-of-action class feature (f44) and missingness indicator (f45). Nuclear-receptor drugs approve at 70% vs tyrosine-kinase inhibitors at 92%.
SHAP-derived interactions + causal analysisAUC 0.813 · 1497 events
Added 5 non-linear interactions from XGBoost SHAP analysis (sponsor x pipeline, trial x priority review, endpoint x base rate). Closed 20% of L1-to-XGBoost AUC gap. Causal analysis: BTD/FTD/ODD are signals of difficulty, not approval accelerants. PoS funnel added. 61 tests, CI pipeline, Lo et al. head-to-head on same test set.
8-K text parsing replaces LLM f14 labelsAUC 0.831 · 1497 events
Built a deterministic SEC 8-K/6-K classifier (negation-scope masking, FDA-approval pattern bank, ratio-threshold aggregation across filings). Replaced 240 of 403 Perplexity-sourced trial-outcome labels with verdicts backed by quoted SEC filing text. 8-K labels tied LLM on realized-outcome accuracy (75.4% vs 75.8%) while restoring full primary-source provenance. Every f14 label now traces to SEC EDGAR, CT.gov, or is honestly flagged miss_f14=1.
Split f16 prior approval into two featuresAUC 0.840 · 1497 events
Recency sanity check revealed the model was overconfident on 5 of 22 2026 PDUFAs (CORT relacorilant, INCY Zynyz expansion, PHAR Joenja sNDA, PTCT Translarna, RGNX gene therapy). Root cause: f16_prior_approval conflated "sponsor has approved drugs before" with "this exact molecule is already on the US market." Split into f16a (sponsor competence, learned weight +0.75) and f16b (label expansion, learned weight -0.72 — opposite sign!). 2026-slice AUC lifted from 0.803 to 0.838, matching the pooled headline. CORT 0.977 -> 0.810, INCY 0.890 -> 0.779.
Per-drug CRL history feature (f51)AUC 0.850 · 1497 events
Added a drug-level CRL-history feature distinct from sponsor-level f32/f33. Translarna (ataluren) had been CRL'd twice before its 2026 PDUFA; no existing feature captured that this specific molecule keeps getting rejected. f51 counts prior CRLs for the same active ingredient (token-based match, capped at 3, normalized to [0,1]). Learned weight -2.31, third-largest magnitude in the model. PTCT Translarna: 0.969 -> 0.856 (CRL correctly predicted). 2026-slice AUC jumped to 0.863 — above the pooled headline. Brier drops back to 0.105, ECE back to 3.0%.
Data-quality sweep + modality flagAUC 0.853 · 1497 events
Data audit: fixed 5 weekend-dated catalyst events (FDA only acts on weekdays), resolved an f49 feature-ID collision, and shipped an INN-stem-based modality classifier. Added f52_is_novel_modality flag (gene therapy, cell therapy, oligonucleotide, oncolytic virus). Empirical: novel modalities approve at ~50-70% in training vs 83-92% for small molecules/mAbs. Expanded the L1 alpha grid to include 0.0015/0.002; 0.002 wins on AUC. Base-rate drift vs BIO 2011-2020 (oncology -16pp, heme -24pp) documented in methodology; isotonic recalibration absorbs it. AUC 0.850 -> 0.853, ECE 3.0% -> 2.9%.
Backtest hardening + point-in-time leak auditAUC 0.846 · 1909 events
Comprehensive look-ahead audit on every feature path. Closed two real leaks: (1) pivotal-trial primary_completion_date — 8.5% of test events were leaking Phase-3 results from up to 29 months in the future via _group_programs; (2) f25 CTO_PHASE3 lookup — same leak via NCT-keyed CTO scores that use post-trial features. Both fixed in _group_programs / _flatten_features. 8-K f14 extractor default also tightened (days_after 180 -> 0). Confirmed safe: openFDA, AdComm, CRL count, sponsor track record, cash/burn, foreign approvals. Honest test AUC 0.853 -> 0.846 [bootstrap 95% CI 0.79, 0.86]. Walk-forward AUC ranges 0.72-0.91 by year. Mega-pharma stratification: 0.96 AUC on PFE/JNJ/etc., 0.81 on small/mid-cap (the events that matter for analyst use). Brier 0.106, ECE 4.7%. 22/22 spot-checked labels match openFDA ground truth. Cleanest backtest the engine has ever had.
CURRENT METRICS
0.853
AUC-ROC
public-only test, 95% CI [0.81, 0.89]
85%
ACCURACY
optimal threshold
954
EVENTS
2006-2026
301
TEST SET
2021-2026 holdout
33
FACTORS
28 primary + 5 learned interactions
5
DATA SOURCES
live API feeds
BENCHMARKS
MODELEVENTSAUC
ApprovalAlpha (35 factors, same test set)5320.853
Lo et al. 2019 method (reproduced on our data)5320.807
Phase x Indication base rate5320.667
NEXT TARGETS
AUC 0.80Expand to 1,200+ events, market-implied probability feature
AUC 0.82FDA review document sentiment extraction, 1,500+ events
AUC 0.85Molecular structure features, proprietary data integration