Beyond curation: held-out resolution of gene-therapy targets

Wilson Mudaki¹

¹Mishale, Inc., Wilmington, Delaware, USA.

Technical note · 16 June 2026

Does a target-assessment engine reason about biology, or recite what it was configured to know? The distinction is measurable, and it matters to anyone relying on the output. Across 37 monogenic diseases held out of Mishale’s curated set, the engine — given only a disease name — recovered the established causal gene in 75% of the cases it committed to (85% within its top three), and declined rather than guess on the remainder. It resolved to a wrong gene in 3 of 37. The benchmark runs against every release.

75%

exact gene when it commits
(15/20; 95% CI 53–89%)

85%

correct gene in top three
(17/20; 95% CI 64–95%)

3/37

resolved to a wrong gene;
declines rather than guess

The benchmark

Mishale resolves a disease to its causal gene through a general pipeline — ontology mapping, a knowledge graph assembled from authoritative public sources, and ranking by the strength of the causal relationship. None of that path depends on the disease being one we have previously studied. To quantify how the engine behaves on unfamiliar biology, we hold out 37 monogenic diseases, each with a single, canonical causal gene and each deliberately outside the curated strategy set.

The conditions are strict by design. The engine receives only the disease name and must discover the gene itself. Held-out status is verified case by case — the resolved disease identifier is confirmed absent from our curated entries — so the score cannot borrow credit from prior curation. And three outcomes are scored separately: the engine may return the correct gene, return a wronggene, or decline. For the earliest, most consequential decision in a programme, these are not equivalent — a wrong gene commits capital in the wrong direction with false confidence, whereas a decline returns the decision to the team intact.

Result

From the disease name alone, the engine recovered the established gene across metabolic, lysosomal, ophthalmic, neurological and haematologic disorders it had never been taught — among them choroideremia (CHM), Alport syndrome (COL4A5), Usher syndrome type 1B (MYO7A) and familial Mediterranean fever (MEFV). Among the diseases it committed to, it returned the exact causal gene in 75% of cases (15/20; 95% Wilson CI 53–89%) and placed the correct gene within its top three in 85% (17/20; 95% CI 64–95%). It resolved to a wrong gene in only 3 of 37; on the rest it declined.

Exact gene (top-1) — 15 (41%)

Correct in top-3 — 2 (5%)

Declined — 17 (46%)

Wrong gene — 3 (8%)

Outcome distribution across the 37 held-out diseases. The engine resolves the clear cases precisely and abstains on the rest. “Declined” is a deliberate hand-off, not a failure to compute.

Outcome across 37 held-out diseases	n	Share
Resolved — exact causal gene (top-1)	15	41%
Resolved — correct gene within top three	17	46%
Declined to resolve (abstained)	17	46%
Resolved — wrong gene	3	8%

Top-three is inclusive of top-1. Rows are not mutually exclusive across the two “correct” tiers.

Abstention is a design choice

A 46% abstention rate is easily misread as the engine “working half the time.” The opposite is closer to the truth. The purpose of an early feasibility assessment is to protect the most expensive decision in a programme before any capital is committed. An instrument that resolves the clear cases precisely and signals “this one needs a human” on the rest is behaving exactly as a triage should: high precision when it speaks, an explicit hand-off when it cannot. Mishale is built to abstain rather than fabricate — the same property reported in our validation study, where no adversarial input produced an invented target.

The current edge

We name our limits. The incorrect resolutions cluster at a single hard boundary: fine-grained subtype disambiguation — distinguishing, for instance, spinocerebellar ataxia type 3 from type 36 from a free-text name alone. Disease families with many causal genes are where we counsel the most human oversight, and where the engine most often, and correctly, declines.

Scope

This benchmark measures target resolution — whether the engine identifies the right gene. It is distinct from the downstream guide, vector and dosing assessments, each of which carries its own stated confidence and validation status in every memo. The 37 diseases are clean single-gene disorders chosen for an unambiguous ground truth; the figures are indicative of generalization on well-characterised monogenic biology, with intervals that reflect the sample size. “Correct” denotes agreement with the canonical gene–disease assignment, not a claim about clinical outcome. Every Mishale assessment is a computational analysis that scopes and de-risks laboratory work; it does not replace it.

Why we publish this

The value of a computational feasibility tool rests on whether its output can be trusted — and on whether it tells you where not to trust it. So we measure generalization on held-out biology, we run that measurement on every release rather than asserting it once, and we publish the figure, the abstention rate and the edges. That is the standard of evidence we hold ourselves to.

To see how Mishale reasons about a target you care about, request a feasibility memo or write to contact@mishale.bio.