Joseph Alise

A user filed a report on a system I operate: "this worked two days ago." The same input had triggered a crisis intervention on one run and sailed through with an ordinary answer on the next. Same input. Opposite outputs. Filed as a regression.

It wasn't a regression. It was worse.

The investigation almost lied to me twice

The first verdict came from a screenshot. The second came from a git log scoped to a single file — used to defend a claim about a change that touched four. Neither conclusion had been measured. Both were inference wearing a verdict's uniform, and both were mine.

The third pass did it the slow way: the exact input, run through every safety check at both code versions, the failure-handling path read line by line. That one held.

A safety verdict requires running the actual input through the actual code. Screenshots and scoped git logs are hypothesis generators, not measurement. I knew that. I still reached for the fast answer twice before doing the work.

What the slow pass found

The highest-risk input class had no deterministic coverage. Its only net was a probabilistic classifier — a coin flip on ambiguous phrasing, catching the same input on some runs and missing it on others.

And on any error, any non-OK response, any timeout past two seconds, the control returned "no finding" and let the input through.

The failure branch was commented:

// FAIL SAFE

It was the opposite. It failed open.

Nothing had changed in the code. The two contradictory outputs were the control working exactly as built: a coin flip in fair weather, absent entirely in foul. The green test suite proved the control was configured. It never proved the control was reachable when its dependency failed — because nobody had tested the failure path. Only the happy one.

The fix was old discipline, not new prompts

On a ship, every critical valve has a designed failure position — fail open or fail shut — and the watchstander knows which, because someone decided it on purpose and wrote it down. Nobody discovers a valve's failure position during the casualty.

That's the standard this control now meets:

- Deterministic checks fire first. The highest-risk input class moved onto pattern checks that run before any network call. The coin flip became a guarantee.
- The classifier was demoted to defense-in-depth. Still there, still useful — a second layer, never again the sole net.
- Failure polarity is a named decision. The failure branch is now a function whose comment explains the choice: fail open, fail closed, or fail to a safe default — with the rationale written down. A mislabeled branch that nobody read is not a decision. It's a gap wearing the uniform of one.

And then the test that actually matters: a suite that stubs the classifier out entirely and proves the protection still fires with the dependency dark. It passed this week. That green check is the only one I trust, because it's the only one that interrogated failure instead of configuration.

What transfers

1. A green test suite proves configuration, not behavior under failure. Test the path where the dependency dies. Happy-path coverage on a safety control is not safety coverage.
2. Measure, don't infer. On a safety verdict, run the actual input through the actual code, and scope the conclusion to the whole change — not the file you happened to read.
3. Treat a persistent, specific anomaly report as signal. The user who said "this worked two days ago" was wrong about the cause and right about everything that mattered.

The full case study — context, investigation, root cause, resolution — is published at Blue Jacket Consultancy: The Safeguard That Passed Every Test and Failed Open, with a downloadable PDF.

Steady on. ⚓

The canonical case study is published by Blue Jacket Consultancy at bluejacket.io and © 2026 Blue Jacket Businesses LLC, d/b/a Blue Jacket Consultancy; it may be shared in unmodified form with attribution, and not used for machine-learning training without written permission. This log entry is personal commentary around that artifact.

The investigation almost lied to me twice

What the slow pass found

The fix was old discipline, not new prompts

What transfers

Subscribe to the Operating Log