Every industry that has killed people at scale eventually learned the same lesson. The lesson is not simply that systems fail. It is that oversight fails when it shares a failure mode with the thing it is supposed to oversee.
On 1 June 2009, Air France Flight 447 disappeared over the Atlantic. Its three pitot tubes – the sensors that infer airspeed from pressure – iced over together because they shared the same design, the same mounting, and the same atmospheric exposure. Two hundred and twenty-eight people died in a mechanically intact aircraft whose instruments degraded along the same causal chain as the emergency they were supposed to diagnose.1 In 2023, a French criminal court acquitted both Air France and Airbus: the failures were real, but distributed across so many actors and decisions that no single party could be held to account under the available legal framework.2 The victims were failed twice – first by the event, then by a structure that diffused responsibility until accountability arrived blurred, delayed, and structurally weakened.
The operational response, in every mature domain that has faced this kind of failure, has been the same: define what the system is supposed to do, measure deviation from it with independent instruments, and force consequences when the assumptions stop holding. AI governance has not yet learned this.
This paper argues that AI governance should be rebuilt around calibrated, independent measurement infrastructure – not around trust in the systems' own self-descriptions or the good intentions of their operators. That means commissioned operating scopes, external monitoring with explicit uncertainty, mandatory governance transitions when assumptions break, and auditable evidence trails. The argument that follows explains why.
Redundancy protects only when parallel channels fail independently. When they share a vulnerability, adding more of them changes nothing. Air France Flight 447's three pitot tubes were variants of the same design, mounted in the same places, exposed to the same atmosphere. The Airbus A330 carried three precisely so that one failure would be absorbed by the others. The conditions that mattered defeated all three together.1 3
Reliability engineering calls this common-cause failure. IEC 61508 treats diversity – different implementations, different design lineages – as the core defence.4 After the crash, the European Union Aviation Safety Agency's airworthiness directive required mixed-manufacturer pitot configurations, institutionalising sensor diversity as a regulatory requirement.3
AI governance is building the same architecture of false redundancy. Proposals from developers and evaluators alike to monitor frontier AI systems using other models from similar training pipelines are sold as oversight. They may instead be the equivalent of three pitot tubes in one weather system. When the monitor shares weaknesses with the system being monitored, adding more monitors of the same kind provides no real protection. Dung and Mai applied common-cause failure analysis directly to AI and found that the methods most commonly used to steer AI behaviour – training on human preference ratings, training on other AI models' ratings, training weaker models to supervise stronger ones – share many correlated failure modes.5
That is why the lesson for AI governance is not merely technical. When the monitoring system shares weaknesses with the system it oversees, risk becomes harder to detect in the present – and accountability becomes harder to enforce afterward. What happened next aboard Air France Flight 447 shows exactly how the two compound.
The most dangerous failure in any governed system is the one that changes what the system can safely do without telling anyone. On Air France Flight 447, this is exactly what happened. The autopilot disconnected. The flight control software degraded from its normal protective mode – which would have physically prevented the crew from pitching the aircraft into a stall – to a reduced mode that removed those protections. The aircraft did not announce this. The range of conditions the crew could safely handle shrank at exactly the moment they needed it most, and nobody told them.1
AI governance produces the same failure when a model's performance drifts outside the conditions it was tested for and no mechanism exists to flag the change. The system continues operating. The governance status does not change. The assumptions that justified deployment are no longer valid, but nothing in the architecture registers this.
The architectural response is what this paper calls a validity state protocol – a governance mechanism that tracks whether the assumptions underlying deployment still hold, and enforces explicit transitions between defined states (for example: Commissioning, where the system is being tested; Valid, where it is operating within known bounds; Suspect, where evidence suggests those bounds may have been breached; and Invalid, where they clearly have been) with mandatory actions at each transition. When assumptions break, the degradation is never silent.
A monitoring system can invert under broken assumptions, so that the correct action triggers the alarm and the fatal action silences it. This is a perverse reversal – a failure mode in which the feedback signal does not merely degrade but actively rewards the wrong behaviour. On Air France Flight 447, the stall warning sounded – but the Airbus system's validity logic meant that when the wing's angle to the oncoming air became extreme enough, the computer classified the underlying data as unreliable and the warning stopped. When the crew briefly pushed the nose down – the correct recovery action – the data re-entered the valid range and the warning returned. The correct response reactivated the alarm. The fatal response silenced it.1 6
This is not a case where something was gaming a metric – the well-known problem that any measure used as a target will be distorted.7 Nothing in the aircraft was strategically optimising. It is better understood as a monitor whose signal flips from informative to misleading because the conditions under which the monitor was reliable no longer hold. The measure stops tracking the thing it was supposed to track, and starts rewarding the wrong behaviour.
AI evaluation faces the deliberate version of the same perverse reversal. A system that deliberately underperforms on capability evaluations – appearing less capable than it is – inverts the feedback signal in the same structural way. The system that games the test looks safest; the system that does not looks dangerous. Anthropic's sketches of what a safety case for highly capable AI systems would need to demonstrate make the problem recursive: the evaluations used to detect deception can themselves become targets of deception.8 Apollo Research's work shows current frontier models can already reason about deception and conceal it from evaluators.9
Measurement-based governance does not abolish this. It does three humbler things: it makes the relationship between what is measured and what matters explicit and inspectable; it makes the failure of that relationship itself a triggering event for governance action; and it responds to gaming by layering multiple independent monitoring approaches rather than giving up on measurement.
Diagnostic information that exists inside a system but never reaches the humans who need it is useless for governance. The BEA found exactly this on Air France Flight 447: the aircraft's systems had identified the origin of the failures but had not communicated that diagnosis in a form the crew could use. The information was there. It just never got out.1
AI systems produce the same failure when they hold internal information about uncertainty, error, or drift but surface only opaque scores or pass/fail labels. Regulators should require a small set of inspectable, technology-neutral obligations – what this paper calls regulatory join keys. There are five: what was logged (the record of the system's actions); how deviation was scored (the method used to detect unusual behaviour); what was demonstrated before deployment (the evidence that the system works within defined conditions); what happens when those conditions break (the rules governing transitions between governance states); and what evidence is retained for later audit. These exist so that diagnostic information reaches the humans in a form they can act on. A regulator inspecting these does not need to understand the underlying statistical methods. They need to be able to verify that each obligation is met, documented, and current.
Many fatal failures are not single dramatic events but sequences of individually tolerable steps that collectively exit the safe regime. Air France Flight 447 remained mechanically intact and responsive until impact. The automatic trim system had steadily tilted the tail surface to thirteen degrees nose-up over three minutes, each increment small, the cumulative effect unrecoverable. No single threshold check would have caught it.1
AI systems fail the same way. An individual interaction stays within tolerance while the trajectory as a whole drifts outside the conditions the system was tested for. The statistical tools to detect this kind of drift exist and are maturing rapidly. Methods from the conformal prediction literature provide formal guarantees about how often a monitoring system's claims will be wrong – guarantees that hold without assumptions about the shape of the underlying data, that degrade in quantifiable rather than catastrophic ways when conditions change, and that can be adapted for continuous post-deployment monitoring. Crucially, sequential versions of these methods accumulate evidence over time like a running tally, so that their conclusions remain valid at any point – not just at pre-planned checkpoints. That is how they catch slow drift that snapshot checks miss.10 11 12
Consider what these methods actually measure. They do not measure "safety" or "alignment" directly – those are judgments about a system, not quantities a sensor can read. What is measured are stand-in quantities: scores that capture how unusual the system's behaviour is compared to its baseline, the size of the uncertainty ranges the methods produce, and the accumulated weight of evidence that something has changed. Each of these stand-ins comes with an explicit statement of how uncertain it is and how it connects to the governance concern it is intended to track. The mathematics does not deliver absolute truth. It delivers calibrated bounds on how wrong the conclusion is likely to be under stated assumptions. That is more than trust-based governance usually offers.13 14
When risk is known and measurable but no governance gate forces action at threshold, the risk gets deferred indefinitely – until it kills. The pitot tube vulnerability on Air France Flight 447 was not unknown. Thirty-two prior incidents had been recorded. Airbus knew. The European Union Aviation Safety Agency knew. Air France was already replacing the probes – beginning two days before the crash. The fix cost $222,000 for the entire fleet. No governance gate forced action at threshold.1
Public-sector AI shows the same pattern. When the Department for Work and Pensions published its fairness assessment of the Universal Credit Advances model,15 it concluded that there were minimal concerns of discrimination while flagging age and nationality disparities for further attention, and noting that for reported illness, higher referral rates were accompanied by higher rates of correct referral. The Public Law Project, in written evidence to the Public Accounts Committee drawing on the same published data,16 raised concerns about the model's differential accuracy across groups and the disproportionate referral exposure faced by non-UK nationals. Different readers drawing different fairness conclusions from the same data is exactly the kind of disagreement that measurement infrastructure should surface rather than hide. The Algorithmic Transparency Recording Standard is now mandatory across central government,17 but transparency records document what AI is being used for; they do not function as live trip-wires that stop a system when its performance deteriorates. Only 19 per cent of respondents in the most recent Ada Lovelace Institute and Alan Turing Institute survey had heard of AI being used to assess welfare eligibility.18
The United Kingdom's Data (Use and Access) Act 2025 moves toward a more permissive framework for solely automated significant decisions. That makes measurement infrastructure more urgent, not less. Where monitoring affects individuals, the right governance approach is authorisation tied to a defined scope: consent where consent is meaningful, notice and challenge rights where consent is not the right legal basis, statutory authority where the state acts coercively – each tied to the specific conditions the system was tested for, not open-ended.19
When causation is distributed, accountability frameworks built for singular blame cannot deliver justice. The Air France Flight 447 acquittal illustrates this precisely: the judges acknowledged the failures but found that no single party's negligence met the legal threshold when the causation was spread across design, certification, operations, and training. Each failure was individually insufficient. Together they were fatal.2
The gap is structural, not incidental. A framework that asks "who was negligent?" cannot answer the question when the answer is "everyone partially, no one sufficiently." Measurement-based accountability asks a different question: was the system's scope of operation defined before deployment? Were the stand-in measures for risk calibrated under stated assumptions? Was the system monitored while running? Were the rules about what must happen when thresholds are crossed actually followed? Was the evidence preserved? That question can be answered regardless of whether any single actor was "negligent" in the traditional sense.
Trust-based governance rarely gives way because someone writes a good memo. It gives way because failure makes the old arrangement indefensible. The Piper Alpha oil platform explosion killed 167 workers in the North Sea in 1988; the United Kingdom moved to the Safety Case regime, requiring operators to identify hazards, quantify risks, and demonstrate that risk has been reduced to acceptable levels, and empirical work over seventeen years found a sustained decline in dangerous incidents.20 21 The 1956 Grand Canyon mid-air collision killed 128 people; aviation moved from visual separation to instrumented surveillance and later to programmes that monitor routine flight operations for early signs of degradation – a transition that depended on just culture and de-identification, because monitoring data used only for punishment will be resisted, manipulated, or rendered useless.22 The United States Nuclear Regulatory Commission's Reactor Oversight Process replaced discretionary judgment with structured monitoring, escalating from routine to shutdown-level intervention through an explicit decision framework.23 24 Post-market drug safety monitoring exists because clinical trials before approval are structurally incapable of revealing the full harm profile of a drug; Hazell and Shakir found a median underreporting rate of 94 per cent, and the system still generates enough signal for label changes, restrictions, and withdrawals.25
The pattern is the same across all four domains. Trust-based governance persists until failure reveals its structural limits. Then monitoring, calibration, and explicit intervention thresholds become core infrastructure. Regulatory economics explains why: oversight is made robust not by placing the smartest observer inside the same system, but by separating the monitoring process from the thing being monitored – the same principle that sensor diversity mandates enforce.26 27 28 29
Measurement infrastructure designed only by technical operators optimises for what it can already see. People affected by automated systems notice harm before dashboards do, and notice different harms. Welfare advisers, legal aid lawyers, disability advocates, and frontline workers observe failure modes that never appear in internal monitoring data. Robust governance needs shared reference scenarios, public challenge routes, and distributed scrutiny that can pressure the measurement system itself. Mature measurement disciplines already work this way – cross-laboratory comparison in clinical chemistry, distributed reporting networks in drug safety, reporting cultures in aviation.30
Not all important harms are measurable. Bias, dignity loss, chilling effects, and democratic distortion do not collapse into one sensor reading. Measurement-based governance separates the evidence question from the values question so each can be addressed with the right tools – and community intelligence is how the values question stays connected to lived experience.
The hardest honest objection: advanced AI systems are strategic, not passive. The statistical guarantees described in this paper degrade gracefully when conditions change – but they are not designed to withstand a system actively trying to fool the monitor. The sober response is diversity, separation, multiple layers, and no pretence that one detector solves the whole problem. External monitoring raises the cost of evasion. It does not make evasion impossible. Claiming otherwise would be the kind of false precision this paper argues against.
A final objection: institutional capacity. Regulation that specifies outcomes rather than methods works only when regulators can measure those outcomes at acceptable cost.31 Until measurement capability matures, process-based obligations – internal risk planning, documentation, change control, reporting duties – may be necessary scaffolding. But scaffolding is not the building. For capability to mature well, it must be built in the open – versioned, reproducible methods rather than proprietary silos – because closed measurement systems are easier to capture, harder to challenge, and more likely to replicate the blind spots of the systems they inspect.
Trust remains, but demoted. It wraps the institutions that calibrate, inspect, challenge, and stop systems – not the system's self-description.
Air France Flight 447's instruments shared a failure mode with the conditions they were meant to detect. The flight software degraded silently. The stall warning punished the correct response. The diagnostic information existed but was never surfaced. The cumulative drift was individually tolerable and collectively fatal. The risk was known and no mandatory rule forced action. The legal system could not assign accountability because the failure was distributed.1 2 Every one of these is an architectural failure that AI governance is reproducing.
The goal of AI governance is not to certify that a system is "safe." Was the deployment properly commissioned? Were the stand-in measures for risk calibrated under stated assumptions? Did live evidence remain within bounds? When assumptions failed, did the governance status change explicitly, with mandatory consequences? Could affected people challenge what the instruments missed?
If those questions are answered well, safety returns to its proper place: not as a marketing term or a dashboard badge, but as the retrospective social judgment that a risk-governing regime functioned acceptably over time, under scrutiny, in the world.
Safety is retrospective. Governance happens now.