You say "70% confident" before a decision. It resolves against you. You say it again on the next one, and the one after that, and the series resolves against you at roughly the same rate. At some point you will notice that your 70% calls are not being right 70% of the time. But without a record, that noticing comes late — usually after the damage, and blurred by the decisions where the label happened to match the outcome. This article is about closing that gap: measuring whether your stated confidence is informative or whether it is a comfortable fiction you have been carrying undisturbed inside your journal.

By the end, you will be able to run a calibration self-audit — assigning decisions to confidence buckets before outcomes are known, aggregating results per bucket, and reading the resulting table to see which bucket is well-calibrated, which is overconfident, and which is not yet trustworthy because the sample is too small.

What Calibration Actually Means

Calibration is a precise term, not a synonym for accuracy. A forecast is calibrated if, over many repetitions of the same stated confidence level, events occur at roughly that stated rate. If you say "70% confident" across 40 decisions and the outcome you predicted comes true on about 28 of them, your 70% calls are calibrated. If those 40 calls come true on only 20 of them, you were systematically overconfident in that bucket: your stated confidence exceeded your actual hit rate by a wide margin.

The definition requires repetition. Calibration cannot be assessed on one decision or even ten. It is a long-run property — a comparison between a distribution of stated confidences and a distribution of observed outcomes. This is why a single high-confidence call that wins is not evidence of good calibration, and a single high-confidence call that loses is not evidence of poor calibration. Both are samples of one, and samples of one carry almost no information about the underlying rate. The Sample Size and Variance article develops this point in full: the confidence interval around a small observed rate spans so many percentage points that any conclusion drawn from it is more noise than signal.

Calibration is about judgment quality, not profitability. A well-calibrated trader has a confidence score that is informative: when they say 80%, you should expect to see 80% hit rates in that bucket. That informativeness makes decisions more honest — better sizing, less self-deception about edge — but it does not guarantee profitable outcomes. Markets reward skill, timing, risk-taking, and luck in proportions that no calibration table can fully untangle. The goal here is to measure whether your confidence labels mean anything, not to promise returns if they do.

How to Build a Calibration Table

The method has three requirements that are non-negotiable. Each one exists because skipping it destroys the measurement.

Record confidence before the outcome is known. This is the foundational constraint. A confidence label assigned after you already know the result is not a forecast — it is a memory, and memory systematically revises toward the actual outcome. Research on hindsight bias documents this consistently: when people reconstruct what they believed before an event, they anchor on what happened. The calibration table only works if confidence is recorded at decision time, in the same moment as the trade rationale and the entry parameters, before price moves confirm or deny the thesis. A decision journal with a dedicated confidence field is the minimum infrastructure.

Use discrete buckets, not continuous values. Continuous precision (writing "67% confident") looks precise but is not — people cannot distinguish 63% from 67% in their own judgment. Buckets absorb the noise: 50–60%, 60–70%, 70–80%, 80%+. Each bucket pools multiple decisions, which is what you need to compute a meaningful hit rate. Finer granularity is not more informative at the sample sizes available to individual traders; it just fragments the buckets until none has enough data to say anything.

Aggregate per bucket and compute hit rate. Once you have a block of decisions recorded — a minimum of 30 to 50 before any bucket analysis is reliable, and more is better — group them by the stated confidence bucket and tally: how many decisions were made in this bucket, and in how many did the predicted outcome occur? That ratio is the observed hit rate. Compare it to the midpoint of the bucket. If your 70–80% bucket shows a 72% hit rate over 35 decisions, that bucket is developing calibration. If it shows a 48% hit rate, you are calling 70–80% confident on setups you are winning less than half the time.

A simple worked example using hypothetical numbers shows the structure:

Stated confidence bucket	Decisions	Outcomes correct	Observed hit rate	Calibration gap
50–60%	18	14	78%	+23 pp (under-confident)
60–70%	22	14	64%	+0 pp (near-calibrated)
70–80%	16	8	50%	−25 pp (overconfident)
80%+	7	4	57%	−18+ pp (overconfident, small n)

Reading this table: the 60–70% bucket is the most calibrated — stated confidence roughly matches reality. The 50–60% bucket shows under-confidence — those decisions were winning more often than the label implied. The 70–80% bucket is the most actionable finding: the trader was systematically calling high-confidence on situations where they were right only half the time. The 80%+ bucket shows the same direction but has too few decisions to interpret confidently — a note to accumulate more data there before drawing conclusions.

The fix is a process adjustment, not a result verdict. The 70–80% calibration gap does not mean those 16 decisions were bad decisions. It means the trader's internal confidence signal at that level is not reliably informative yet, and sizing or conviction that treats those calls as strong should be reconsidered. A companion skill here is Bayesian Updating: How to Change Your Mind Without Whiplash — once you have calibration data, it becomes a legitimate input for updating how much weight you place on your own "high-confidence" calls.

The Brier Score: A Single Number for Calibration Quality

The calibration table gives you a per-bucket picture. The Brier score gives you one aggregate number across all your forecasts. Defined by meteorologist Glenn W. Brier in his 1950 paper "Verification of Forecasts Expressed in Terms of Probability" (Monthly Weather Review, 78(1), pp. 1–3), it is calculated as the mean squared difference between stated probability and actual binary outcome. The range is 0 to 1 in the modern formulation, and lower is better: a perfect forecaster who expresses correct probabilities every time would score 0; a forecaster who consistently states 100% confidence and is wrong every time would score 1. A score of 0.25 corresponds to the performance of someone who states 50% on every decision regardless of the situation — the uninformed baseline.

The Brier score is useful because it captures both calibration (is your 70% actually 70%?) and resolution (do you use 90% when appropriate, or do you cluster everything near 50% to protect your score?). A trader who hedges all confidence statements toward 50% will look cautious and be hard to criticize, but will score poorly on resolution — the forecasts add no information above the baseline. The full Brier score penalizes both overconfident outliers and non-committal hedging. For practical journaling, computing a Brier score on a batch of 50+ decisions is a one-column spreadsheet operation.

The Evidence That Overconfidence Is the Default

Systematic measurement of calibration across large expert populations consistently finds the same pattern: experts are broadly overconfident, and the degree of miscalibration tends to be larger in domains with delayed, ambiguous, or infrequent feedback. Philip Tetlock's study of 284 political and economic forecasters, reported in Expert Political Judgment (Princeton University Press, 2005), tracked tens of thousands of forecasts over roughly two decades. The result was striking: the expert group was broadly overconfident, performed no better than simple algorithmic baselines on aggregate, and — critically — showed no systematic calibration advantage in the areas of their stated specialty; in Tetlock's data, knowing more tended to increase confidence faster than it improved accuracy.

In Superforecasting (with Dan Gardner; Crown, 2015), Tetlock documented a minority of forecasters — the "superforecasters" — who were notably well-calibrated and consistently outperformed the expert average across a range of questions. The defining characteristic of the superforecasters was not superior domain knowledge. It was a disciplined practice of numerical probability estimation, active revision when new information arrived, and ongoing self-measurement of calibration. The skill was learnable, but it required the infrastructure: records, numerical commitments, and willingness to see the gap between stated and observed rates.

Trading shares the feedback properties that make calibration most difficult. Outcomes arrive with noise attached — a correctly-reasoned decision produces a loss; a poorly-reasoned decision produces a win — so the feedback loop does not automatically correct overconfidence. It often reinforces it. When a confident call wins, the win is attributed to the confidence being warranted. When a confident call loses, the loss is attributed to randomness or execution. Without a calibration table, this asymmetric attribution has no corrective mechanism. The Overconfidence After Wins article covers one specific channel through which this process accelerates during positive streaks.

A Historical Example: Where Calibration Has Been Measured

The domain where probabilistic forecast calibration has been most rigorously documented is meteorology. Beginning with systematic forecast-verification research in the 1960s and 1970s — work by Allan Murphy and Robert Winkler among others — probabilistic precipitation forecasts were tracked against observations over long records. The accumulated evidence showed that weather forecasters, operating in an environment with rapid and unambiguous feedback, consistently produced well-calibrated probability estimates: when they said "30% chance of precipitation," rain fell on approximately 30% of those days over large samples — a calibration result documented in forecast-verification research by Allan Murphy and Robert Winkler (1977) and subsequent National Weather Service studies. When they said 70%, it rained approximately 70% of the time.

This stands out because it demonstrates that calibration is achievable in practice, not merely a theoretical benchmark. It also identifies the conditions under which calibration develops: clear numerical probability expression, regular feedback, institutional tracking of stated probabilities against outcomes, and ongoing training. Remove those conditions — replace numerical expressions with verbal labels, remove the systematic tracking, slow the feedback loop — and calibration deteriorates. Trading, left to natural instinct and vague confidence language, recreates the conditions for poor calibration almost perfectly.

Honest Risk Notes

Three structural limits of calibration tables deserve explicit acknowledgment before you build one.

Small samples are noisy. A bucket with 8 decisions is not producing a reliable estimate of anything. The observed hit rate could move 20 percentage points in either direction with a few more observations. This connects directly to the Sample Size and Variance discussion: the minimum for a bucket to carry any interpretive weight is roughly 20 decisions; 35 to 50 is more robust. Below that threshold, note the bucket and accumulate data rather than drawing conclusions.

Calibration does not transfer automatically across environments. A 70–80% bucket that is well-calibrated in trending conditions may be overconfident in range-bound conditions, because the class of setups you are labeling "70–80% confident" behaves differently under different regimes. Segment your data if you have enough of it; otherwise, treat calibration scores as provisional until you have sampled multiple market environments.

Calibration is a measure of judgment quality, not a predictor of profitability. A perfectly calibrated trader can lose money if their expected value per decision is negative — if they lose more when wrong than they win when right, or if they take less risk on high-confidence calls and more on low-confidence ones. Calibration makes your confidence labels informative. What you do with that information — position sizing, bet selection, frequency — is a separate and subsequent question that calibration alone cannot answer.

Simulator Exercise

Open Abu Terminal and start a Speed Run in any era. Before each decision — before you select a choice card — assign it a confidence bucket: 50–60%, 60–70%, 70–80%, or 80%+. Record it in a separate note or journal column. Do not adjust the bucket after the outcome is revealed. Complete a minimum of 20 decisions in a single session; 30 or more gives the buckets more to work with.

After the run, build the four-row table: for each bucket, count how many decisions fell there and how many resolved in the direction you predicted. Compute the hit rate per bucket as a simple ratio. Compare each observed hit rate to the bucket's midpoint (55%, 65%, 75%, 85%).

Identify the most miscalibrated bucket — the one where the gap between stated confidence and observed hit rate is largest. That bucket is the starting point for a process note: what type of setup were you labeling with that confidence range, and what was systematically wrong about the label? Were you labeling high-confidence on setups that had an obvious alternative explanation you were downweighting? Were you labeling low-confidence on setups that turned out to be cleaner than your stated uncertainty implied?

Repeat across three or four sessions before forming any strong conclusions. One session's calibration table is itself a small sample. The pattern that persists across multiple sessions — the bucket that is consistently miscalibrated — is the diagnostic signal worth acting on.

Authoritative references

Primary and authoritative material used to verify the educational framework and factual context.

Market Structure Data (U.S. Securities and Exchange Commission)
Checklist Before You Trade (U.S. Commodity Futures Trading Commission)

Confidence Calibration: Measuring Whether Your Stated Confidence Means Anything