Our answer is simulation: synthetic users interact with your agent in the same way real users do, and an LLM-based judge evaluates whether he or she responded correctly – across the entire conversation arc, not just in one turn. Three things make it really work: Scenario creation + real conversation import – Our scenario creation agent bootstraps your test suite from your agent’s description. But real users find paths that a generator wouldn’t expect, so we also ingest your production conversations and automatically extract test cases from them. Your coverage evolves according to your users.
Mock Tool Platform – Agents call the tool. Running simulations against the real API is slow and flaky. Our mock tool platform lets you define tool schema, behavior, and return values so that simulations can practice tool selection and decision making without touching production systems.
Deterministic, structured test cases – LLMs are stochastic. A CI test that passes “most of the time” is useless. Instead of free-form prompts, our raters are defined as structured conditional action trees: explicit conditions that trigger specific responses, with support for fixed messages when word-by-word accuracy matters. This means that the synthetic user behaves consistently across all runs – same branch arguments, same inputs – so the failure is a real regression, not noise.
Secura also monitors your live agent traffic. The obvious choice here is a tracing platform like Langfuse or Langsmith – and they are great tools for debugging individual LLM calls. But the mode of failure of conversational agents is different: the bug is not in any one turn, it is in how the turns relate to each other. Have a verification flow that requires name, date of birth, and phone number before proceeding – if the agent skips asking for date of birth and moves on anyway, each individual turn looks fine in isolation. Failures are visible only when you evaluate the entire session as a unit. Sekura is built around it from the ground up. Where tracing platforms evaluate turn-by-turn, Secura evaluates the entire session. Imagine a banking agent where the user fails verification in step 1, but the agent hallucinates and still proceeds. The turn-based evaluator looks at step 3 (address verification) and marks it green – the correct question was asked. Sekura’s judge viewed the full transcript and declared the session a failure because verification was never successful.
Try us at https://www.cekura.ai – 7-day free trial, no credit card required. Paid plans from $30/month.
We also posted a product video if you want to see it in action: https://www.youtube.com/watch?v=n8FFKv1-nMw. The first minute dives into quick onboarding – and if you want to jump straight to the results, skip to 8:40.
Want to know what the HN community is doing – how are you testing behavioral regressions in your agents? Which failure mode has caused you the most harm? Happy digging down!
<a href