Every statistic you encounter is a finished argument — something was counted, someone chose what to count, a window was set, a denominator was picked, and the failures that did not survive long enough to be counted were quietly omitted. The headline gives you the conclusion. The argument's scaffolding is gone. This lesson teaches four questions that rebuild it — and explains why precision in a number is not the same thing as accuracy.
By the end, you will be able to apply a four-question interrogation to any market statistic before it influences a decision, and you will have a concrete simulator exercise to practice the habit under realistic pressure.
A Statistic Is an Argument With Parts Removed
When a fund manager, journalist, or product deck says "returns of X%," a useful thought is: this is the conclusion of an argument I have not been shown. Somewhere upstream, choices were made about what to include, how to define success, which calendar window to use, and which participants to measure. Those choices can be defensible or motivated. You cannot tell from the headline. The four questions are a framework for recovering the missing parts.
The Four Questions
1. Compared to what?
A number without a reference point is decoration. A fund that returned 12% last year is presenting a fact. Whether that fact is impressive, mediocre, or a warning sign depends entirely on what a passive alternative returned over the same period. A 12% return in a year when a comparable index returned 27% is a claim worth examining, not celebrating. Before any statistic moves your decision, name the benchmark. If no benchmark is offered, ask why.
2. Who's missing?
The most dangerous distortion in financial media is not a lie — it is omission. When a data set is assembled from the funds, strategies, or assets that still exist at the time of measurement, the failures have already left the room. Funds that closed, strategies that blew up, and accounts that were zeroed out are no longer in the sample. The measured average now belongs to a population that was selected, partly, by its ability to survive. This is survivorship bias, and it inflates every average it touches. The question "who's missing from this sample?" is not optional — it is the single most corrective question you can ask about any performance dataset.
3. Over what window?
Any run of returns, however short, can be presented as a track record. A strategy that made money during a single unusual regime — a low-volatility bull market, a brief momentum cycle, a specific interest-rate environment — may show a compelling number over that window. Extend the window to include the regime that preceded it, or the one that followed, and the picture changes. Window selection is one of the most available tools for producing a statistically impressive but practically misleading result. The question is not just "how long?" but "what happened before and after the window you are showing me?"
4. Out of how many?
A single impressive result, or even a short series of them, is consistent with luck. A medical test that is 90% accurate will still produce thousands of false positives at population scale. A fund that outperforms its benchmark for three consecutive years in a universe of ten thousand funds is not necessarily skilled — it may simply be the tail of a normal distribution. Before a performance claim changes your view, ask how many entities were running a similar approach. The answer determines whether the result reflects something real or is simply what you would expect from a large enough sample of coin flips.
Survivorship: The Missing Failures
The survivorship problem deserves its own treatment because it is so systematically underestimated. When you look at the fund performance data available at any point in time, you are not seeing the full history — you are seeing the history of the funds that survived. The ones that closed, were merged, or liquidated after poor performance are gone from the dataset. This means the average return of "existing funds" is always higher than the average return of "all funds that ever ran," because the worst outcomes have been removed by the act of failing.
The S&P SPIVA scorecard directly addresses this. The SPIVA methodology uses the CRSP Survivor-Bias-Free US Mutual Fund Database, which corrects for this distortion by including funds that were liquidated or merged during the measurement period. The result is a materially different picture than a database of only surviving funds would show. According to the SPIVA US Year-End 2024 scorecard (S&P Dow Jones Indices), 84.34% of actively managed US large-cap equity funds underperformed the S&P 500 over the 10-year period ending December 31, 2024. Over the 15-year period, 89.50% underperformed. These figures are survivorship-bias-corrected — the denominator includes funds that did not survive the full measurement window. Without that correction, the underperformance figures would appear smaller, because the weakest performers would already be gone.
Eugene Fama and Kenneth French, in their 2010 paper "Luck versus Skill in the Cross-Section of Mutual Fund Returns" (Journal of Finance, Vol. 65, No. 5, pp. 1915–1947), used bootstrap simulations to examine whether mutual fund outperformance reflected genuine skill or chance. Their conclusion, in the language of the paper's abstract: "few funds produce benchmark-adjusted expected returns sufficient to cover their costs." That framing is precise. It does not say no fund has skill — it says the number that can demonstrate skill sufficient to overcome its own cost structure is small. The distinction matters when reading any claim that a particular manager's past returns are evidence of durable edge.
Number-Anchoring: Precision Feels Like Accuracy
There is a second trap layered on top of the four structural questions: the brain treats a precise number as a more credible one. A claim of "84.34% of funds underperformed" feels more authoritative than "roughly 85% underperformed" — even though precision and accuracy are unrelated properties. Precision is a formatting choice. Accuracy is a matter of whether the measurement is correct, appropriately defined, and honestly contextualized. The 84.34% figure from SPIVA is both precise and well-supported because the methodology is documented and survivorship-corrected. But when you encounter a precise number in a media headline or a product deck, precision is not evidence of rigor. It is frequently the opposite — a way of lending false authority to an imprecise or cherry-picked measurement. Precision should prompt questions, not suppress them.
The practical implication: a round number that comes with a transparent methodology is more trustworthy than a two-decimal figure with no sourcing. Always ask for the denominator and the method before the precision of a number moves your confidence.
The Discipline: Four Questions as a Gate
The process fix is mechanical: before any statistic influences a decision, run it through all four questions as a written gate. Not a mental note — written. The exercise of writing forces you to commit to what you know and what you do not.
- Compared to what? Name the benchmark. If you cannot, the claim is incomplete.
- Who's missing? Ask whether the sample includes failures or only survivors. If the source does not address this, treat the figure as an upper bound on performance, not a representative average.
- Over what window? Find the start and end date. Ask what came immediately before the window began and whether the claimed regime was typical or exceptional.
- Out of how many? Identify the sample size and the universe it was drawn from. A top-performing fund out of five thousand is a different claim than a top-performing fund out of twelve.
A statistic that answers all four questions may still be wrong. But one that cannot answer any of them should not be allowed to move a decision. The goal is not to become a statistician — it is to build a consistent gate that prevents a headline from doing work it has not earned.
Note that this interrogation is distinct from two related disciplines: Source Hygiene (vetting the incentives and reliability of the entity producing the number) and Auditing a Market Narrative (testing whether a broader thesis holds across multiple evidence types). Those skills operate at the level of the source and the argument. The four questions here operate at the level of the individual number — before you decide whether to let it anchor your thinking at all.
Speed Run Exercise: The Headline Stat Gate
Open Abu Terminal and start a Speed Run in any era. At some point in the run, Abu will present an event card that references a market statistic — a return figure, a success rate, a performance comparison. When you reach one of these cards, pause before marking your answer and work through the four questions in writing, on paper or in a note.
For each question, mark it as one of three states: Answered (the card provides enough to respond), Unknown (the card does not give you the information), or Not applicable (the question does not apply to this type of claim). After working through all four, generate a verdict:
- If all four are Answered: the statistic is provisionally usable — proceed, but note what you assumed.
- If one or two are Unknown: the statistic is incomplete — hold it with low confidence; it can inform but should not anchor.
- If three or four are Unknown: the statistic is not yet trustworthy for decision purposes — set it aside and note what you would need to verify it.
After the Speed Run, review the debrief. Look for any event where your decision was influenced by a number on the card. Check whether you ran the four questions before acting on it. If you did not, that event is your behavioral data point — the specific conditions under which the gate fails to activate. Speed and loss pressure are the most common suppressors. The simulator is the place to notice them before they cost money.
The goal is not a perfect score. The goal is to find the conditions under which you skip the gate and to make those conditions visible before they matter outside the simulator.
Related Reading
Source Hygiene: Vetting Where Information Comes From covers how to evaluate the incentives and reliability of a source — the layer upstream of the number itself. Auditing a Market Narrative: Tests Before You Believe a Theme applies a broader multi-test audit to an entire thesis rather than a single figure. Survivorship Bias in Data: You Are Only Seeing the Winners examines how the missing failures distort backtests and strategy lists beyond fund performance. Backtest Honesty: Why a Beautiful Historical Test Can Be Worthless addresses overfitting and look-ahead bias — two ways a backtest produces a precise and misleading number without technically lying.
Updated: June 13, 2026
Educational simulator content, not financial advice.