Every number a price feed returns looks like a fact. It arrived on time, it carries a timestamp, and it sits in a clean column alongside thousands of others. That appearance of precision is the problem. A number can be technically present in a feed and still be wrong — wrong because an automated system misfired, wrong because a corporate action was never applied, wrong because the records have a silent gap, or wrong because the dataset quietly omitted the entities that failed. None of those errors announce themselves.

This article teaches a four-step sanity check you should run on any price series before you compute anything from it. By the end you will be able to identify the four most common classes of data-quality anomaly — bad ticks, unadjusted corporate actions, data gaps, and survivorship bias — and you will have a routine to catch them before they corrupt a calculation.

How this differs from the Data Audit Trail. The Data Audit Trail article addresses provenance: recording where your own inputs came from, when you captured them, and what method produced them. That is a discipline about your internal recordkeeping. This article addresses a prior question: whether the third-party numbers you received from a data vendor or price feed are technically valid in the first place. Provenance tracking does not help you if the underlying numbers were already corrupted before you wrote them down. These two disciplines are complementary and sequential — check the data first, then document it.

Anomaly 1: Bad Ticks

A bad tick is a single recorded price that is physically implausible given its neighbors. The canonical form is a spike: a series reading ..., 48, 50, 49, 312, 51, 50, ... where one value is five or six times larger than everything around it. The 312 is not a real price — it is almost certainly a feed error, a keystroke error upstream, or a transmission glitch that placed a decimal point in the wrong position.

Bad ticks are dangerous because they pass straight through most automated pipelines. They are just numbers; the pipeline does not know they are impossible. If you feed that series into a calculation — a range, a volatility measure, a moving average, a percentage-change signal — the outlier distorts every result. A volatility measure using the spike will report an implausibly large number. A percentage-change calculation will treat a move from 50 to 312 as a 524% gain, which is then a 84% collapse back to 51. Neither move happened. Both corruptions propagate forward into anything built on that series.

The check: scan for values that are an extreme multiple of their immediate neighbors. The exact threshold depends on the instrument and the timeframe, but a value more than three to five times the surrounding range is a candidate for flagging rather than feeding. Illustrative hypothetical: a series showing ..., 99, 101, 100, 500, 101, 100, ... flags the 500 immediately as a likely bad tick, not a price event. (Numbers invented for illustration.)

Anomaly 2: Unadjusted Corporate Actions

Stock splits and dividend distributions create step-changes in raw price series that look like large price moves but represent no economic change whatsoever. In a 2-for-1 split, the price halves on the ex-date because shareholders receive twice as many shares — their total holding value is unchanged. In a raw, unadjusted series, this appears as a 50% single-day decline. A reader who does not check the corporate action calendar will conclude the stock lost half its value. It did not lose anything.

Data vendors such as CRSP address this by computing a cumulative adjustment factor and retroactively applying it to all historical prices on and before the ex-distribution date. The result is an adjusted-close series, in which the price history is scaled so that the ex-date step-change disappears and all returns reflect genuine economic performance. An adjusted series is smooth across the split; an unadjusted series has a hard gap at the exact ex-date.

The check: when you see a single-day drop (or occasionally a rise, for certain adjustments) of 25%, 33%, 50%, or another simple fraction, look up whether an ex-date fell on that day before treating it as a real move. The percentage-drop pattern matters: splits tend to produce drops that correspond to exact share ratios (2-for-1 = 50%; 3-for-1 = 67%; 3-for-2 = 33%). Dividend-related drops are usually smaller but can be meaningful for high-yield instruments. If you are building any return calculation over a multi-year horizon, confirm whether the series you are using is adjusted-close or raw, and understand the implication of that choice for every return figure you derive. The choice of adjustment method is itself a variable that changes results — the same underlying history produces different return numbers depending on how corporate actions were treated.

Anomaly 3: Data Gaps

A data gap is a period for which records simply do not exist in the feed. The cause may be a market holiday, a venue halt, a vendor outage, or a retrieval error that was never noticed. Gaps are particularly easy to miss because silence is invisible: a chart skips seamlessly from one side of the gap to the other, and the visual continuity suggests there is nothing missing.

Gaps distort any calculation that assumes continuity. A moving average that passes over a five-day gap is computing over a different set of observations than you intended. A return series that jumps from day T to day T+6 without acknowledging the missing days produces a single return that conflates six sessions. Volatility measures understate actual volatility when gaps are filled with flat values, and overstate it when the first post-gap observation happens to be large.

The check: count the records in your series and compare to the expected count for your date range and asset class — accounting for known market holidays. A raw equity series for a US stock from January 2 to December 31 of any given year should have roughly 252 trading-day records. A meaningful shortfall is a gap signal. A more direct method: sort the series by date and compute the spacing between consecutive timestamps. Any spacing more than one trading day (excluding holidays) is a gap. Flag it before computing anything that depends on continuity. See also the Source Hygiene framework for evaluating whether a given vendor's gap rate is acceptable for your purpose.

Anomaly 4: Survivorship Bias

Survivorship bias enters when a dataset silently excludes entities that failed, merged, or were delisted before the data was assembled. Because failing entities tend to have worse records than entities that survived, a dataset of survivors systematically overstates the performance characteristics of the population it purports to represent.

The foundational academic documentation of this problem in financial data comes from Elton, Gruber, and Blake (1996), "Survivorship Bias and Mutual Fund Performance," published in The Review of Financial Studies, 9(4), pp. 1097–1120. Their finding, applied to mutual fund databases: databases that omit funds that disappeared — through closure, merger, or poor performance — overstate the measured performance of the fund universe, because the disappearing funds disproportionately had poor records. The qualitative direction is clear and documented: survival-filtered datasets produce upward-biased performance estimates. The precise magnitude depends on the dataset, the time period, and the construction method, so the quantitative effect should be verified against the specific data source you are using rather than assumed from any single published figure.

In an equity context, survivorship bias takes the form of studying only the stocks that are currently in a major index, or that are currently listed. Companies that went bankrupt, were acquired, or were delisted during your study period are absent from that dataset. If you use that dataset to study how stocks behave under certain conditions, your conclusions are conclusions about how stocks that survived those conditions behaved — which is a fundamentally different question than how stocks in general behaved.

The check: ask, explicitly, whether your data source includes delisted or failed entities. Most retail and some institutional data sources do not, by default. If they do not, any pattern you identify in that data is conditioned on survival, and every performance figure is higher than the true population average. This is not a correctable error after the fact — it is a dataset-construction choice that needs to be known before you interpret anything. For more on how selection effects distort statistical conclusions, see Statistics Traps in Media.

The Four-Step Sanity-Check Routine

Run these four checks in order before you compute anything from a new price series. They take less than five minutes on a series you can plot.

Eyeball the extremes. Plot or sort the series and look at the minimum and maximum values. Do they make physical sense for this instrument? A price that is five or ten times any neighboring value is a bad-tick candidate. Flag it; do not delete it silently — note that you flagged it and why, so that the flag itself becomes part of your Data Audit Trail.
Check for adjustment. Identify whether the series is adjusted-close or raw. If you do not know, look for single-day step-changes of exact fractional magnitudes (50%, 33%, 25%) and cross-reference corporate action calendars. If the series is unadjusted and you are computing multi-period returns, find an adjusted-close series or apply adjustments yourself before proceeding.
Count for gaps. Compare the record count to the expected count for your date range and asset class. Sort by date and check timestamp spacing. Any unexplained gap longer than one trading day should be noted and its impact on your calculations assessed before you proceed.
Ask what is missing. Determine whether the dataset includes entities that failed or were delisted during your study window. If it does not, all performance figures are survival-conditioned. Note this limitation explicitly when interpreting any result, and consider whether the conclusion you are drawing depends on the excluded entities behaving differently from the survivors — because, on average, they did.

Clean-looking data can pass all four checks and still contain errors. Adjustment methodology varies between vendors; a series that one vendor labels "adjusted-close" may treat dividends differently than another vendor's series with the same label. The sanity-check routine reduces risk — it does not eliminate it. Calibrate your confidence in the data proportionally to how thoroughly you have checked it. For a rigorous approach to evaluating the upstream credibility of any data source, the Backtest Honesty article addresses how data-quality choices interact with backtest results specifically.

The Knight Capital Illustration

On August 1, 2012, Knight Capital Americas experienced a catastrophic systems failure when a software deployment activated a dormant code component in its automated trading system (SMARS). Over approximately 45 minutes, the system routed millions of unintended orders into US equity markets before the problem was identified and stopped. Knight lost over $460 million in that period. The failure was documented in detail by the SEC in Release No. 70694, issued October 16, 2013.

The Knight Capital incident belongs in a discussion of data anomalies for a specific reason that requires precise framing: it is not an example of a corrupted incoming price feed. Knight was receiving accurate market data. The failure was an internal systems and controls failure that caused its own automated pipeline to produce erroneous orders at scale. The output of Knight's system was garbage; the input was not.

What this illustrates for data hygiene is the more general principle: automated pipelines — whether they are executing orders or computing analytics — can generate outputs that look syntactically valid but are semantically wrong. The same principle applies to any automated data-collection or data-transformation pipeline you depend on. A pipeline that ran correctly yesterday and is running today does not guarantee that every output it produced is correct. Code bugs, version conflicts, and configuration errors can all cause a clean-looking output to be wrong. The sanity check described above is your manual verification layer that sits outside the automated pipeline — the layer that catches what the pipeline does not know to catch about itself.

Risk Note: Clean-Looking Data Can Still Be Wrong

Running the four-step check significantly reduces the chance that you are computing from corrupted data. It does not make your data provably clean. Several residual risks are worth naming explicitly.

First, adjustment method choices are not standardized. Two vendors can apply different dividend adjustment approaches to the same security and produce materially different historical return figures. Neither is necessarily wrong — they are applying different methodological choices. You need to know which choice was made and whether it is appropriate for your specific application.

Second, bad ticks can pass the extremes check if they are only moderately anomalous, or if the surrounding data is also noisy. A value that is 30% higher than its neighbors may or may not be a tick error — context matters, and context requires judgment.

Third, survivorship bias is difficult to quantify retrospectively. You can know it is present; estimating its magnitude for a specific dataset and time period requires data on what was excluded, which is often not available.

Fourth, data can be internally consistent but wrong at the level of what it represents. A series that accurately records closing prices for an illiquid instrument may still be a poor proxy for the price at which you could actually transact, because closing prices in illiquid markets can be stale or thinly-formed. Technical validity is a necessary condition, not a sufficient one.

Simulator Exercise

Abu Terminal's Speed Run mode presents price context from real historical eras. Before you engage with any Speed Run scenario, treat the era's price series as an unknown feed and mentally run the four-step check as a deliberate warm-up. Ask yourself: in this era, which corporate actions were common? (The dot-com era, for example, was dense with stock splits in technology names.) Are there any instruments in this era that are no longer traded — and if so, does the scenario include them or only the survivors? What would a data gap look like in the context of this era's price chart?

A more structured drill: in the Abu Arena, request an intentional anomaly round (if available in your training level) where a planted bad tick, a split step-change, or a gap has been inserted into a short series. Your task is not to trade — it is to identify which anomaly type you are looking at and state the mechanism by which it would corrupt a return calculation if left uncorrected. The value of this drill is not the correct answer; it is the habit of looking at data critically before computing with it. That habit does not develop from knowing the theory. It develops from running the check, even on a short invented series, until the four questions become automatic.

A reflection prompt after any Speed Run session: did the price context you were given feel fully trustworthy? If the era included index-level data, was the index survivorship-adjusted to include companies that failed during the period? If you are not sure, that uncertainty is precisely the right output from practicing data skepticism — it means the habit is forming.

Authoritative references

Primary and authoritative material used to verify the educational framework and factual context.

Market Structure Data (U.S. Securities and Exchange Commission)
Checklist Before You Trade (U.S. Commodity Futures Trading Commission)

Market Data Anomalies: A Four-Step Sanity Check Before You Compute