Every LLM feature demos beautifully. The distance between that demo and something you can put in front of customers is mostly boring engineering — and it is exactly where projects die.
Before tuning a single prompt we collect 50–200 real inputs with agreed-correct outputs. That set is the contract: every prompt change, model upgrade or provider switch runs against it before deploy. Without it, “it feels better now” is the only metric you have — and feelings do not survive a model deprecation.
Providers rate-limit, sessions expire, models time out. A trustworthy feature has a degradation ladder decided in advance: switch provider, serve cached result, fall back to the non-AI path, or fail visibly with a human handover. In one of our production systems the render pipeline silently fails over between two generation providers; users notice a slower answer, not an outage.
Log the full chain — input, retrieved context, prompt version, model, latency, output, user reaction. When a customer reports a wrong answer three weeks later, you replay it instead of guessing. Tag every output with the prompt version that produced it; “which prompt was live on Tuesday?” should be a query, not an archaeology project.