Blog

AI Features Need Contracts

The fastest way to impress people with an AI feature is to show a demo. Point a model at an image or a block of messy text, extract a few fields, and watch the UI fill itself in.

The fastest way to disappoint them is to ship that exact demo straight into production.

I like AI features. I think good vision and language models can remove a lot of repetitive work and make products feel dramatically more helpful. But the thing that keeps making or breaking these features is not usually the model itself. It is the boundary around it.

My default now is simple: if model output will update records, trigger workflows, or feed another system, the feature needs a contract.

The model is good at compression, not at being your database

A model can look at noisy input and produce something useful from it very quickly. That is real leverage. What it is not good at is being treated as an untyped source of truth just because the first ten examples looked clean.

Free-form prose is fine when the output is meant for a human to read and judge. A chat response. A summary. A draft description. The moment the output starts mutating state, free-form text becomes a bad interface. Now one invented enum value or one missing field is no longer a harmless quirk. It is a broken workflow, a dirty record, or an expensive cleanup.

That is why I like schema-constrained output so much. Both OpenAI and Google's Gemini API now expose structured output modes built around JSON Schema. That matters less as a vendor feature and more as a design signal. The contract no longer needs to live only in your prompt. It can exist as an explicit, typed boundary.

A schema is a product decision

When people talk about structured AI output, they often frame it as an implementation detail. I do not think it is. Designing the schema is part of designing the feature.

Which fields are required? Which values are allowed? Which ambiguity gets preserved for human review instead of being forced into a category? Which fields are safe for the model to suggest, and which ones should only be set by deterministic application logic? Those are product choices.

JSON Schema's object rules make this concrete. You can say that fields are required. You can restrict extra keys with additionalProperties. You can constrain enums and nested shapes. That is not glamour work, but it is where reliability starts.

{
  "type": "object",
  "required": ["brand", "category", "confidence"],
  "additionalProperties": false,
  "properties": {
    "brand": { "type": "string" },
    "category": { "enum": ["shoe", "bag", "watch"] },
    "confidence": { "type": "number" }
  }
}

I like small, strict contracts. They force the feature to admit what it is actually trying to do.

Validation is necessary. It is not enough.

This is the part that trips people up. A valid object can still be wrong. The schema can guarantee shape. It cannot guarantee truth.

That means the job is not finished once the payload parses. You still need semantic checks. Does the category make sense for this seller? Does the detected brand conflict with existing catalog data? Is the confidence low enough that a human should review it? Did the extracted value change a regulated field that should never be auto-applied?

I think this is where a lot of AI features quietly become normal software engineering again, which is good news. You can validate. You can compare. You can build guardrails. You can hold uncertain cases in a review queue instead of pretending every output deserves the same level of trust.

Keep decisions deterministic when you can

One pattern I like is separating extraction from decision-making. Let the model propose structured facts or candidate labels. Let deterministic code decide what happens next.

That sounds small, but it changes the reliability profile of the whole system. "The model thinks this image is a black leather boot with brand X" is one kind of output. "Therefore publish this product, price it like that, and sync it to four external channels" is a much bigger leap.

I would rather let the model narrow the search space and let application code apply business rules, confidence thresholds, inventory rules, and side effects. The model can be smart. The system around it should still be explicit.

Evals beat vibes

Another thing I have become more convinced of is that prompt feel is a bad production metric. You need examples, failure buckets, and a repeatable way to test changes.

NIST's AI Risk Management Framework puts a lot of weight on test, evaluation, verification, and validation across the AI lifecycle, and its generative AI profile calls out the need to measure, monitor, and document erroneous outputs and reliability in operation. I think that is the right instinct. Model quality is not a static property. It is something you keep checking.

The practical version of this is not mysterious. Save representative samples. Track regressions. Keep a set of cases that used to fail. Re-run them when you change prompts, models, schemas, or post-processing. The official structured outputs eval example from OpenAI is worth looking at even if you use a different stack, because the habit matters more than the tooling.

Free text still has its place

I am not arguing that everything should become JSON. Plenty of AI features are supposed to sound human. Drafting copy, helping a support agent, summarizing a thread, or explaining a result to a user. Those are good places for free-form output.

I only get strict when the output crosses a trust boundary. If it writes to the database, feeds a search index, triggers a workflow, or calls another API, I want a contract and I want validation. That is where "close enough" stops being good enough.

The demos people remember usually come from the model. The production features people trust usually come from the contract around it.

A few things that informed this