Executive summary
If you sign contracts with AI vendors or buy AI-powered software, you have been told that your vendors do "safety testing." Almost certainly, you have not been told what that means. You have not been told which specific behaviors were tested for, which were not, whether the comparisons your vendor cites favor themselves, or whether the version of the model you are actually paying for is the version that was tested.
This paper gives you the five questions to ask in your next AI vendor review. It uses iFixAi, an open-source AI misalignment diagnostic, as a concrete reference for what a defensible answer looks like. You do not need to run iFixAi or any other tool to use the framework. You need to ask the questions and require the answers in writing.
The five questions, in summary:
- What specifically did you test, by inspection name?
- Did you run cross-vendor comparisons, and if so, how did you disclose the bias of self-scoring?
- Can you produce the manifest and transcripts of every test run, on demand?
- What categories of misalignment were not tested, and why?
- When was the last full run, and against which version of the deployed model?
A vendor that can answer all five with specifics has done the work. A vendor that cannot has signed you up to vouch for them without evidence.
Why this matters now
Three things are converging.
Regulatory exposure is no longer theoretical. The EU AI Act began phased enforcement in 2025. Several US states have enacted or proposed AI-disclosure requirements. The FTC and the UK's CMA have both opened formal inquiries into AI vendor practices. When the question arrives, "we trusted the vendor's safety claims" is not an answer.
Customer trust is being repriced. B2B buyers are increasingly inserting AI-disclosure and AI-testing clauses into procurement language. Your own customers will start asking the same questions of you, sourced from the AI vendors in your stack. The chain of "we trust them, they trust them" terminates at someone with the receipts. You want to be that someone, or several links higher.
Brand exposure is asymmetric. One incident in which your AI-powered product behaves in a way a journalist can characterize as misaligned, biased, or unsafe, traced back to a vendor who ran "safety testing" that turns out to mean nothing, becomes your story. The vendor moves on. You inherit the news cycle.
The cost of doing this evaluation up front is one or two extra questions in a procurement call. The cost of not doing it shows up later, larger, and on your balance sheet.
The five questions
1. What specifically did you test, by inspection name?
The single most useful question. A vendor that has done real testing can name the inspections. A vendor that has not will respond with categories ("bias," "hallucination," "harmful content") and no further detail.
iFixAi, as the reference, runs 32 named inspections across five named categories. The names are public. A vendor running it can tell you exactly which 32 inspections were executed against their model. A vendor running their own internal test regime should be able to do the same. If the answer is a category list with no inspection names, the work was not done at the level a regulator will require.
What to require: the inspection list, in writing, in the contract or in a side letter. Not "we test for safety." The names of the tests.
2. Did you run cross-vendor comparisons, and how did you disclose the bias?
Most AI vendors will, at some point, tell you their model "outperforms" or "is safer than" a competitor's. Behind those numbers is almost always a comparison run with the vendor's own evaluation infrastructure, using the vendor's own provider credentials, scored against the vendor's own benchmarks. The methodology is structurally biased toward whichever vendor commissioned the comparison.
A vendor with intellectual honesty discloses this. iFixAi, the reference, refuses to produce cross-vendor comparative scores at all if you supply credentials for only one provider. The manifest itself disclaims any such result. That is what acknowledging the bias looks like.
What to require: any comparative claim the vendor makes about safety must be accompanied by a written disclosure of the credential setup, the evaluation infrastructure, and the funding source for the comparison. If your vendor cannot produce that disclosure, treat their comparative claims as marketing, not evidence.
3. Can you produce the manifest and transcripts of every test run, on demand?
This is the audit-trail question, and it is the one that separates real testing from a slide deck.
A real safety test regime produces artifacts: a manifest of what was tested, transcripts of the inputs and outputs of every test run, and a timestamp. iFixAi writes all of this to a runs/<run_id>/ directory by default. A vendor running it should be able to give you a run identifier and produce the directory contents on request, redacted as appropriate.
A vendor that has been doing this for a while will have run identifiers going back months. A vendor that has not will explain why they cannot produce the artifacts, usually citing internal policy or a tool that "does not export."
What to require: contractual obligation to retain manifests and transcripts for the duration of the contract plus a defined window (24 months is a defensible standard), and a defined procedure to surface specific run output on request within a defined timeframe (10 business days is generous).
4. What categories of misalignment were not tested, and why?
The honesty test. Any real safety regime has scope. iFixAi tests five categories. It does not, for example, audit the training data pipeline. It says so plainly.
A vendor that claims to have tested everything has either tested nothing rigorously or is misrepresenting the scope of their testing. A vendor that can name the categories they did not test and explain why is doing the work.
What to require: a statement of out-of-scope categories, in writing. This is not a confession. It is a feature. Every defensible safety regime in any industry, financial controls included, has explicitly out-of-scope items. The ones that claim to cover everything are the ones to worry about.
5. When was the last full run, and against which version of the deployed model?
Models change. Prompt regimes change. Provider-side updates land that change behavior without an SLA. A safety test run in February against version 1.2 of the model says nothing about the version 1.5 the vendor is shipping in November.
iFixAi's manifests record the model version and the run timestamp. A vendor that has been running it should be able to tell you exactly which version of the deployed model the most recent full run was against, and when.
What to require: a contractual obligation to re-run the full diagnostic suite on every major model version change, and to disclose the run identifier to you within a defined window after the change. Without this, a vendor can claim to have done safety testing on the version of the model they shipped in 2024 and continue invoking that claim while shipping a different model in 2026.
What good looks like
The reference implementation referenced throughout this paper is iFixAi, an open-source command-line tool that:
- Runs 32 named inspections across five named categories.
- Writes manifests and transcripts to a versioned directory per run.
- Refuses to produce cross-vendor comparative scores without paired provider credentials, and disclaims them explicitly when they are produced.
- Names what it does not test and why.
- Records the model version and timestamp in every run.
iFixAi is not the only acceptable answer. There are commercial diagnostic regimes that meet the same bar. The point is the bar, not the tool. A vendor whose safety regime can produce equivalent artifacts is meeting the standard. A vendor whose safety regime cannot produce them is not, regardless of what their marketing claims.
You do not need to run any of these tools yourself. You need to require that your vendors produce the equivalent artifacts and that the contract reflects the obligation.
Red flags
A vendor's safety claims should be treated as marketing, not evidence, in the presence of any of the following:
- Category-only descriptions of testing ("we test for bias and hallucination") with no inspection names.
- Comparative claims with no disclosure of credential setup or funding.
- Inability to produce a manifest or transcripts for a recent run within a reasonable window.
- No statement of out-of-scope categories.
- No statement of which model version was last tested.
Any single one of these is a yellow flag. Two or more is a vendor whose claims you should not put weight on without independent verification.
What to put in the contract
The legal language varies by jurisdiction and counsel, but the substance is consistent. In any AI vendor contract:
- The vendor's specific testing regime is referenced by name, ideally including the inspection list.
- The vendor commits to retain manifests and transcripts for the contract term plus 24 months.
- The vendor commits to produce specific run output on request within 10 business days.
- The vendor commits to re-run the full diagnostic suite on every major model version change and disclose the run identifier within 30 days.
- The vendor states out-of-scope categories.
- Any comparative safety claim the vendor makes externally is accompanied by a methodology disclosure.
These are not extraordinary asks. They are the equivalent of any other vendor compliance obligation in any other category of enterprise software. AI vendors are starting to expect them. The ones not yet expecting them are the ones to be most rigorous with.
Closing
This framework is not about turning your team into an AI safety engineering function. You do not need to run iFixAi. You do not need to read 32 inspection definitions. You need to ask five questions and require the answers in writing.
The work this paper describes takes maybe an hour added to a vendor evaluation. The work it prevents, an incident, a regulatory finding, a story you are quoted in, takes a quarter of focus to recover from. The math is straightforward.
For the engineering side of this conversation, the companion blog post reviews iFixAi specifically.
